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Project  Objectives 

This  project  had  three  principle  aims: 

1.  Improving  the  scalability  and  efficiency  of  “Ultra-scale”  methods  for  grid-based  solutions  to 
time-dependent  PDEs; 

2.  Sparse  storage  and  reconstruction  of  information; 

3.  Build-in  several  levels  of  resiliencies  to  handle  various  hard  faults  in  the  system. 

Progress  was  made  in  all  three  areas,  leading  to  fifteen  published  refereed  articles,  five  articles 
in  review,  one  completed  masters  thesis,  and  one  doctoral  thesis  in  progress.  Broadly,  PI  Ong 
and  his  research  team  worked  primarily  in  objectives  1.)  and  3.),  developing  parallel-in-time  and 
domain  decomposition  transmission  conditions  to  improve  scalability  and  resiliency  of  computa¬ 
tions.  PI  Christlieb,  PI  Wang  and  their  respective  research  teams  worked  primarily  in  objective 
2.),  developing  sparse  FFT  algorithms  and  tackling  the  phase  retrieval  problem. 

Personnel 

This  projected  supported  three  faculty,  three  postdoctoral  fellows,  and  two  graduate  students. 
Mr.  High  completed  his  M.Sc.  in  the  Department  of  Mathematics  at  MSU  under  the  supervision  of 
Dr.  Ong,  and  is  now  pursuing  a  Ph.D  in  computational  science  at  UIUC.  Dr.  Ala  Alzaalig  is  still 
working  on  his  doctoral  degree  at  Michigan  Technological  University. 

Faculty  Supported: 

•  PI:  Dr.  Andrew  J.  Christlieb  (2012-2015) 

•  PI:  Dr.  Benjamin  W.  Ong  (2012-2015) 

•  PI:  Dr.  Yang  Wang  (2012  -  2014) 

Postdoctoral  Scholars  Supported: 

•  Dr.  Yang  Liu  (2012  -  2014) 

•  Dr.  Ke  Wang  (2013  -  2014) 

•  Dr.  Bankim  Mandal  (2014  -  2015) 

Graduate  Students  supported: 

•  Mr.  Scott  High  (2013,  graduated  with  M.Sc) 

•  Mr.  Ethan  Novak  (graduate  research  project,  Summer  2015) 

•  Mr.  Ala  Alzaalig  (PhD  in  progress) 
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Scientific  Workshops/Conferences 

In  addition  to  the  AFOSR  Computational  Mathematics  annual  review,  results  from  research  related 
to  this  project  was  disseminated  at  the  following  workshops/conferences.  This  list  does  not  include 
departmental  seminars/colloquia  at  various  universities. 

1.  “The  Phase  Retrieval  Problem”,  IPAM  Workshop  on  Adaptive  Data  Analysis,  Los  Angeles, 
CA,  2012 

2.  “Minimal  frames  for  Phase  Retrieval”,  Workshop  on  Phaseless  Reconstruction,  FFT  2013, 
College  Park,  MD,  2013 

3.  “Robust  Sub-Linear  Time  Fourier  Algorithms”  SIAM  Conference  on  Computational  Science 
and  Engineering,  Boston,  MA,  2013 

4.  “An  Optimized  RIDC-DD  Space-time  Method  for  Time  Dependent  Partial  Differential  Equa¬ 
tions”,  SIAM  Conference  on  Computational  Science  and  Engineering,  Boston,  MA,  2013 

5.  “The  Phase  Retrieval  Problem”,  International  Conference  on  Approximation  Theory  and 
Applications,  Hong  Kong,  2013 

6.  “  Mathematical  Investigation  of  Authorship  Attribution:  A  Case  Study”,  New  Trends  in 
Applied  Harmonic  Analysis,  CIMPA  2013,  Mar  del  Plata,  Argentina,  2013 

7.  “Pipeline  Schwarz  Waveform  Relaxation”,  Domain  Decomposition  22,  Lugano,  Switzerland, 
2013 

8.  “The  Phase  Retrieval  Problem” ,  Workshop  on  Applied  Harmonic  Analysis  and  Approximation 
Theory,  Guangzhou,  China,  2014 

9.  “A  Robust  and  Efficient  Phase  Retrieval  Algorithm”,  5th  International  Conference  on  Scien¬ 
tific  Computing  and  Partial  Differential  Equations,  Hong  Kong,  2014 

10.  “Fast  Phase  Retrieval  for  High  Dimensions”,  AMS  Spring  Sectional  Meeting,  East  Lansing, 
2015 

11.  “Sub-Linear  Sparse  Fourier  Algorithm  for  High  Dimensional  Data”,  SIAM  Annual  Meeting 
2014,  Chicago,  IL,  2015 

12.  “RIDC  methods  with  stepsize  control”,  fth  Workshop  on  parallel-in-time  integration,  Dres¬ 
den,  Germany,  2015 
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Summary  of  Results 

1.  Fast  phase  retrieval  for  high-dimensions 
M.  Iwen,  A.  Viswanathan,  Y.  Wang 
eprint  arXiv:  1501. 02377 

Description:  A  fast  phase  retrieval  method  which  is  near-linear  time,  making  it  compu¬ 
tationally  feasible  for  large  dimensional  signals.  Both  theoretical  and  experimental  results 
demonstrate  the  method’s  speed,  accuracy,  and  robustness.  We  then  use  this  new  phase  re¬ 
trieval  method  to  help  establish  the  first  known  sublinear-time  compressive  phase  retrieval 
algorithm  capable  of  recovering  a  given  s-sparse  vector  x  £  Cd  (up  to  an  unknown  phase 
factor)  in  just  0(slog5  s  ■  log  rl)-time  using  only  0(slog4  s  •  log  d)  magnitude  measurements. 

2.  Detection  of  edges  from  two-dimensional  Fourier  data  using  Gaussian  modifiers 
A.  Gelb,  G.  Song,  A.  Viswanathan  and  Y.  Wang 

eprint 

Description:  The  detection  of  edges  from  two-dimensional  truncated  Fourier  data  is  studied. 
Compared  to  edge  detection  from  pixel  data,  this  is  a  more  challenging  problem  since  we  seek 
accurate  local  information  from  a  small  number  of  often  noisy  global  measurements.  Here  we 
develop  a  highly  effective  algorithm  using  a  specific  class  of  spectral  modifiers  which  converges 
uniformly  to  sharp  peaks  along  the  singular  support  of  the  function. 

3.  Random  matrices  and  erasure-robust  frames 
Y.  Wang 

eprint  arXiv:  1403.5969 

Description:  Data  erasure  and  robustness  are  important  considerations  for  building  redun¬ 
dant  systems  (frames).  Can  you  build  a  system  (frame)  which  is  robust  against  more  than 
50%  data  erasures?  This  was  the  conjectured  upper  bound  within  the  community.  This  paper 
shows  that  there  isn’t  in  fact  such  an  upper  bound.  The  random  Gaussian  frames  can  be 
robust  against  data  erasures  of  arbitrary  high  percentage  of  erasure. 

4.  On  the  decay  of  the  smallest  singular  value  of  submatrices  of  rectangular  matrices 
Y.  Liu  and  Y.  Wang 

Description:  The  main  contribution  of  this  paper  is  to  show  the  connection  between  the 
singular  value  problem  and  a  combinatorial  geometry  problem.  Using  a  technique  from  inte¬ 
gral  geometry  and  from  the  perspective  of  combinatorial  geometry,  we  show  that  the  smallest 
singular  value  of  submatrices  is  realted  to  the  minimal  distance  of  points  to  the  lines  connect¬ 
ing  two  other  points  in  a  bounded  point  set.  The  decay  rate  of  the  minimal  distance  for  the 
set  of  points  can  then  be  estimated. 

5.  A  distributed  and  incremental  SVD  algorithm  for  agglomerative  data  analysis  on  large  net¬ 
works 

M.  A.  Iwen  and  B.  W.  Ong 
eprint  arXiv:  1601. 07010 

Description:  In  this  paper,  an  algorithm  is  formulated  to  compute  the  singular  value  decom¬ 
position  of  highly-rectangular,  distributed  matrices  efficiently  using  a  hierachical  approach. 
The  algorithm  is  proven  to  recover  exactly  the  exact  decomposition  if  the  rank  of  the  input 
matrix  is  known  a  priori.  Additionally,  the  algorithm  can  be  used  to  recover  the  d-largest 
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singular  vectors  with  bounded  error.  The  algorithm  is  shown  to  be  stable  with  respect  to 
roundoff  errors,  or  corruption  of  the  original  matrix  entries. 


6.  Robust  sparse  phase  retrieval  made  easy 
M.  Iwen,  A.  Viswanathan  and  Y.  Wang 
Applied  and  Computational  Harmonic  Analysis 
to  appear 

doi:  10.1016/j.acha.2015.06.007 

Description:  In  this  paper  we  develop  a  two  stage  phase  retrieval  algorithm  for  phase 
retrieval  of  sparse  vectors.  It  is  incredibly  fast  and  robust.  Furthermore  it  requires  the 
optimally  small  number  of  measurements.  Our  algorithm  also  settles  a  conjecture  on  the 
number  of  measurements  needed  to  perform  phase  retrieval  for  complex  signals. 

7.  A  multiscale  sub-linear  time  Fourier  algorithm  for  noisy  data 

A.  J.  Christlieb  and  D.J.  Lawlor  and  Y.  Wang 
Applied  and  Computational  Harmonic  Analysis 
to  appear 

doi:  10.1016/j.acha.2015.04.002 

Description:  The  sparse  Fourier  algorithm  for  noiseless  signals  is  extended  to  the  noisy 
setting.  We  present  two  such  extensions,  the  second  of  which  exhibits  a  novel  form  of  error- 
correction  not  unlike  that  of  the  /3-encoders  in  analog-to-digital  conversion.  The  algorithm 
runs  in  time  0{k\og(k)\og{N /k))  on  average,  provided  the  noise  is  not  overwhelming.  The 
error-correction  property  allows  the  algorithm  to  outperform  FFTW  over  a  wide  range  of 
sparsity  and  noise  values,  and  is  to  the  best  of  our  knowledge  novel  in  the  sparse  Fourier 
transform  context. 

8.  Pipeline  Schwarz  Waveform  Relaxation 

B.  W.  Ong,  S.  High  and  F.  Kwok 

Lecture  Notes  in  Computational  Science  and  Engineering,  Domain  Decomposition  Methods 
in  Science  and  Engineering  XXII 
to  appear 

Description:  Schwarz  Waveform  Relaxation  methods  are  reposed  to  allow  for  pipeline  par¬ 
allelization.  This  increases  the  scalability  of  the  waveform  relaxation  algorithms  with  high 
effiiency. 

9.  Algorithm  xxx  -  a  family  of  parallel  time  integrators 
B.  W.  Ong,  R.  D.  Haynes  and  K.  Ladd 

ACM  Transactions  on  Mathematical  Software 
to  appear 

Description:  The  Revisionist  Intergal  Deferred  Correction  software,  a  parallel-in-time  inte¬ 
grator,  is  able  to  bootstrap  lower  order  time  integrators  to  provide  high-order  approximations 
in  approximately  the  same  wall  clock  time.  The  user  supplied  time  step  routine  may  be  ex¬ 
plicit  or  implicit  and  may  make  use  of  any  auxiliary  libraries  which  take  care  of  the  solution 
of  any  nonlinear  algebraic  systems  which  may  arise. 

10.  Stable  signal  recovery  from  phaseless  measurements 
B.  Gao,  Y.  Wang  and  Z.  Wu 
Journal  of  Fourier  Analysis  and  Applications 
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(2015)  pp.  1-21 

doi:  10. 1007 /  s00041-015-9434-x 

Description:  This  paper  studies  the  stability  of  the  t\  minimization  for  the  compressive 
phase  retrieval  and  to  extend  the  instance-optimality  in  compressed  sensing  to  the  real  phase 
retrieval  setting.  We  first  show  that  the  m  =  0(klog(N/k))  measurements  is  enough  to 
guarantee  the  i\  minimization  to  recover  Axsparse  signals  stably  provided  the  measurement 
matrix  A  satisfies  the  strong  RIP  property.  We  use  the  results  to  build  a  parallel  between 
compressive  phase  retrieval  with  the  classical  compressive  sensing. 

11.  Gabor  orthonormal  bases  generated  by  the  unit  cubes 
J.-P.  Gabardo,  C.-K.  Lai  and  Y.  Wang 

Journal  of  Functional  Analysis 
Vol.  269  (2015),  pp  1515-1538 

Description:  This  paper  studies  Gabor  orthonormal  bases  generated  by  the  characteristic 
functions  of  a  unit  cube.  A  complete  characterization  is  given. 

12.  Probabilistic  Estimates  of  the  Largest  Strictly  Convex  Singular  Values  of  Pregaussian  Random 
Matrices 

Y.  Liu 

Journal  of  Mathematics  and  Statistics  (2015) 

Description:  The  p-singular  values  of  random  matrices  with  Gaussian  entries  defined  in 
terms  of  the  lp-p- norm  for  p  >  1  is  studied. 

13.  The  probabilistic  estimates  on  the  largest  and  smallest  (/-singular  values  of  random  matrices 
M.-J.  Lai  and  Y.  Liu 

Mathematics  of  Computation 
84:294  (2015),  pp.  1775  -  1794 
doi:  10.1090/S0025-5718-2014-02895-0 

Description:  In  this  paper,  the  (/-singular  values  of  random  matrices  with  pregaussian  entries 
in  the  case  0  <  q  <  1  are  studied.  The  main  result  are  decay  estimates  on  the  lower  and 
upper  tail  probabilities  of  the  (/-singular  values.  The  fc-th  (/-singular  value  of  an  m  x  n  matrix 
A  is  defined  by 

(q)  ■  c  ll^-^llg 

c? v  '  —  int  oim  - — 

5 

q 

inf  is  taken  over  all  linear  subspace 


D.  Mao,  Y.  Wang,  and  Q.  Wu 
Advances  in  Adaptive  Data  Analysis 
(2015),  pp  1550001 
doi:  10.1142/S1793536915500016 

Description:  We  developed  a  new  approach  for  the  analysis  of  physiological  time  series  for 
the  purpose  of  detection  and  classification.  An  iterative  convolution  filter  is  used  to  decom¬ 
pose  the  time  series  into  various  components.  Statistics  of  these  components  are  extracted  as 


v  zeV\{o}  llxl 

where  ||  •  \\q  denotes  the  Z9-quasinorm  {q  >  0)  and  the 
V  €  Rn  of  dimension  at  least  n  —  k  +  1 . 

14.  A  new  approach  for  analyzing  physiological  time  series 
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features  to  characterize  the  mechanisms  underlying  the  time  series.  Motivated  by  the  stud¬ 
ies  that  show  many  normal  physiological  systems  involve  irregularity  while  the  decrease  of 
irregularity  usually  implies  the  abnormality,  the  statistics  for  “outliers”  in  the  components 
are  used  as  features  measuring  irregularity.  Support  vector  machines  are  used  to  select  the 
most  relevant  features  that  are  able  to  differentiate  the  time  series  from  normal  and  abnormal 
systems.  This  new  approach  is  successfully  used  in  the  study  of  congestive  heart  failure  by 
heart  beat  interval  time  series. 

15.  Multiple  authors  detection:  a  quantitative  analysis  of  Dream  of  the  Red  Chamber 

X.  Hu,  Y.  Wang  and  Q.  Wu 
Advances  in  Adaptive  Data  Analysis 
Vol  6,  Issue  4  (2014),  pp  1450012 
doi:  10.1142/S1793536914500125 

Description:  We  develop  an  robust  method  based  on  machine  learning  as  well  as  an  effective 
set  of  features  for  the  detection  of  multiple  authorship  within  a  book.  We  apply  our  method 
to  the  historic  authorship  controversy  to  show  that  the  commonly  read  version  of  Dream  of 
the  Red  Chamber ,  one  of  the  greatest  novel  in  the  Chinese  literature,  must  be  written  by  two 
authors  as  suspected. 

16.  Invertibility  and  robustness  of  phaseless  reconstruction 
R.  Balan  and  Y.  Wang 

Applied  and  Computational  Harmonic  Analysis 
Vol  38  (2015),  pp.  469-488 
doi:  10. 1016/j.acha.2014. 07.003 

Description:  This  paper  is  concerned  with  the  question  of  reconstructing  a  vector  in  a 
finite-dimensional  real  Hilbert  space  when  only  the  magnitudes  of  the  coefficients  of  the 
vector  under  a  redundant  linear  map  are  known.  We  analyze  various  Lipschitz  bounds  of  the 
nonlinear  analysis  map  and  we  establish  theoretical  performance  bounds  of  any  reconstruction 
algorithm.  We  show  that  robust  and  stable  reconstruction  requires  additional  redundancy 
than  the  critical  threshold. 

17.  Revisionist  integral  deferred  correction  with  adaptive  stepsize  control 
A.  Christlieb,  C.  Macdonald,  B.  Ong  and  R.  Spiteri 
Communications  in  Applied  Mathematics  and  Computational  Science 
Vol  10,  Number  1  (2015),  pp.  1-25 

doi:  10. 2140/camcos. 2015. 10.1 

Description:  This  paper  builds  stepsize  control  into  the  revisionist  integral  deferred  correc¬ 
tion  framework.  Three  variants  are  explored.  In  the  most  successful  variant,  the  prediction 
level  is  used  for  step-size  control. 

18.  Phase  retrieval  for  sparse  signals 

Y.  Wang  and  Z.  Xu 

Applied  and  Computational  Harmonic  Analysis 
Vol  37  (2014),  pp.  531-544 
doi:  10.1016/j.acha.2014.04.001 

Description:  In  this  paper  we  provide  a  theoretical  foundation  for  sparse  signal  phase 
retrieval.  We  build  a  parallel  frame  for  sparse  signal  phase  retrieval  that  is  analogous  to 
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the  theoretical  framework  for  compressive  sensing.  In  particular,  we  extend  the  RIP  property 
and  the  Null  Space  property  from  compressive  sensing  to  sparse  phase  retrieval. 

19.  Phase  retrieval  from  very  few  measurements 
M.  Fickus,  D.  Mixon,  A.  Nelson  and  Y.  Wang 
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FAST  PHASE  RETRIEVAL  FOR  HIGH-DIMENSIONS 


MARK  I  WEN,  ADITYA  VISWANATHAN,  AND  YANG  WANG 


Abstract.  We  develop  a  fast  phase  retrieval  method  which  is  near-linear  time,  making  it  compu¬ 
tationally  feasible  for  large  dimensional  signals.  Both  theoretical  and  experimental  results  demon¬ 
strate  the  method’s  speed,  accuracy,  and  robustness.  We  then  use  this  new  phase  retrieval  method 
to  help  establish  the  first  known  sublinear-time  compressive  phase  retrieval  algorithm  capable  of 
recovering  a  given  s-sparse  vector  x  £  <Cd  (up  to  an  unknown  phase  factor)  in  just  0(s  log5  s  ■  log  d)- 
time  using  only  0(slog4  s  ■  logd)  magnitude  measurements. 


1.  Introduction 

We  consider  the  phase  retrieval  problem  of  recovering  a  given  vector  x  £  <Cd,  up  to  an  unknown 
global  phase  factor,  from  a  set  of  squared  magnitude  measurements  |Mx|2  £  M,D,  with  D  >  d. 
Here  M  £  (&Dxd^  and  |  •  |2  :  — >  1RD  computes  the  componentwise  squared  magnitude  of  each 

vector  entry.  Our  objective  is  to  design  a  computationally  efficient  recovery  method,  A  :  ft0  —>  <Crf, 
which  can  approximately  recover  x  using  the  magnitude  measurements  |Afx|2  that  result  from  any 
member  of  a  relatively  large  class  of  matrices  M  £  (&Dxd .  More  specifically,  we  require  that 

(1)  A  (|Mx|2)  =  ®-i0x 

for  some  unknown  9  £  [0,  2tt\. 

Phase  retrieval  problems  arise  in  many  crystallography  and  optics  applications  (see,  e.g.,  [40, 
30,  21,  29]).  As  a  result,  phase  retrieval  has  been  studied  a  great  deal  over  the  past  decade  within 
the  applied  mathematics  community.  The  majority  of  this  work  has  focussed  on  establishing  upper 
and  lower  bounds  for  the  number  of  magnitude  measurements  required  for  reconstructing  x  up  to 
a  global  phase  factor.  It  has  been  shown,  e.g.,  that  O(d)  magnitude  measurements  suffice  for  phase 
retrieval  of  both  real  and  complex  vectors  x  £  <Cd  [3,  6,  17].  Furthermore,  it  is  also  known  that 
O(d)  magnitude  measurements  are  required  [22]. 

There  has  also  been  a  good  deal  of  work  done  developing  phase  retrieval  algorithms  which  are 
(i)  computationally  efficient,  (ii)  robust  to  measurement  noise,  and  (iii)  theoretically  guaranteed  to 
reconstruct  a  given  vector  up  to  a  global  phase  error  using  a  near-minimal  number  of  magnitude 
measurements.  For  example,  it  has  been  shown  that  robust  phase  retrieval  is  possible  with  D  = 
O(d)  magnitude  measurements  by  solving  a  semidefinite  programming  relaxation  of  it  as  a  rank-1 
matrix  recovery  problem  [12,  11],  This  allows  polynomial-time  convex  optimization  methods  to 
be  used  for  phase  retrieval.  Furthermore,  the  runtimes  of  these  convexity-based  methods  can  be 
reduced  with  the  use  of  O(dlogd)  magnitude  measurements  [14],  Other  phase  retrieval  approaches 
include  the  use  of  spectral  recovery  methods  together  with  magnitude  measurement  ensembles 
inspired  by  expander  graphs  [2] .  These  methods  allow  the  recovery  of  x  up  to  a  global  phase  factor 
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using  0(d)  magnitude  measurements,  and  run  in  n(d2)-time  in  general.1  All  of  these  approaches 
utilize  magnitude  measurements  | Mxf  resulting  from  either  (i)  Gaussian  random  matrices  M,  or 
(ii)  unbalanced  expander  graph  constructions,  in  order  to  prove  their  recovery  guarantees. 

In  this  paper  we  demonstrate  that  a  relatively  general  class  of  invertible  block  circulant  mea¬ 
surement  matrices  M  £  (&Dxd  results  in  D  =  0(d\ogc  d)  magnitude  measurements,  |Afx|2,  which 
allow  for  phase  retrieval  in  just  0(dlogc  d)-time.2  In  particular,  we  construct  a  well-conditioned  set 
of  Fourier-based  measurements,  M  £  d^xd ,  which  are  theoretically  guaranteed  to  allow  for 

the  phase  retrieval  of  a  given  vector  with  high  probability  in  0(d  log4  o?)-time.  These  measurements 
are  of  particular  interest  given  that  they  are  closely  related  to  short-time  Fourier  transform  based 
measurements,  which  are  off  special  significance  in  several  application  areas  (see,  e.g.,  [15]  and  the 
references  therein).  Numerical  experiments  both  verify  the  speed  and  accuracy  of  the  proposed 
phase  retrieval  approach,  as  well  as  indicate  that  the  approach  is  highly  robust  to  measurement 
noise.  Finally,  after  establishing  and  analyzing  our  general  phase  retrieval  method,  we  then  utilize 
it  in  order  to  establish  the  first  known  sublinear-time  compressive  phase  retrieval  method  capable 
of  recovering  s-sparse  vectors  x  (up  to  an  unknown  phase  factor)  in  only  0(s  logc  d)-time. 

The  remainder  of  this  paper  is  organized  as  follows:  In  section  2  we  establish  notation  and 
discuss  important  preliminary  results.  Next,  in  section  3,  we  present  our  general  phase  retrieval 
algorithm  and  discuss  it’s  runtime  complexity.  We  then  analyze  the  our  phase  retrieval  algorithm 
and  prove  recovery  guarantees  for  specific  types  of  Fourier-based  measurement  matrices  in  section  4. 
In  section  5,  we  empirically  evaluate  the  proposed  phase  retrieval  method  for  speed  and  robustness. 
Finally,  in  section  6,  we  use  our  general  phase  retrieval  algorithm  in  order  to  construct  a  sublinear- 
time  compressive  phase  retrieval  method  which  is  guaranteed  to  recover  sparse  vectors  (up  to  an 
unknown  phase  factor)  in  near-optimal  time. 


2.  Preliminaries:  Notation  and  Setup 


For  any  matrix  X  £  &Dxd  we  will  denote  the  jth  column  of  X  by  Xj  £  Crj.  The  conjugate 
transpose  of  a  matrix  X  £  will  be  denoted  by  X*  £  (&dxD 5  and  the  singular  values  of  any 

matrix  X  £  <DDxd  will  always  be  ordered  as  cri(X)  >  ct2{X)  >  •  •  •  >  crmi nm<u(X)  >  0.  Also,  the 
condition  number  of  the  matrix  X  will  denoted  by  k(X)  :=  cri(X)/aminm^(X).  We  will  use  the 
notation  [n]  :=  {1, . . . ,  n}  CN  for  any  n  £  N.  Finally,  given  any  x  £  <Dd,  the  vector  xsopt  £  <Cd  will 
always  denote  an  optimal  s-sparse  approximation  to  x.  That  is,  it  preserves  the  s  largest  entries  in 
magnitudes  of  x  while  setting  the  rest  of  the  entires  to  0.  Note  that  xsopt  £  <Cd  may  not  be  unique 
as  there  can  be  ties  for  the  sth  largest  entry  in  magnitude. 

Hereafter  we  will  assume  that  our  measurement  matrix  M  £  CDxrf  has  D  :=  (26  —  1  )d  rows, 
for  a  user  specified  value  of  6  £  IN.  Furthermore,  we  utilize  the  obvious  decomposition  of  M  into 
(2 5  —  1)  blocks,  Mi, . . . ,  M2g~i  £  (Cdxd ,  given  by 


(2) 


M  = 


(  Mi 

M2 


\ 


V  M25_i  ) 


Each  Mi  £  (&dxd  is  itself  assumed  to  be  both  circulant,  with 


(3)  (Mi'jij  .  (m z)(j_j)  mod  d  +  1 

for  some  op  £  <Cd,  and  banded,  so  that  (m;)j  =  0  for  all  i  >  6,  and  1  <  l  <  26  —  l.3 


-^Their  runtime  complexity  is  dominated  by  the  time  required  to  solve  an  overdetermined  linear  system. 

n 

"Herein  c  is  a  fixed  absolute  constant. 

^All  indexes  of  vectors  in  <Cd  will  automatically  be  considered  modulo  d,  +  1,  in  this  fashion  hereafter. 
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As  a  consequence,  the  squared  magnitude  measurements  from  the  Zth-block,  |Af^x|2  £  can 
be  rewritten  as 

S 

(4)  (lM;x|2)i  =  (M«x)i  =  Y  (mih'(ml)k 

j,k= l 

Let  y  £  be  defined  by 

(5)  y i  :=  X[^frlX’[^frl+((i+<5-2)  mod  (25— 1))— 5+1- 

Furthermore,  let  0Q  £  Rlx“  be  the  row  vector  of  a  zeros  for  any  given  a  £  N,  and  let  ,D  e  <D1X(S 
be  such  that 


(6)  (mdj))fc  :=  (ml)Aml)k- 

We  can  now  re-express  |Af^x|2  £  ]Rd  from  (4)  as  Miy,  where  Mi  £  (&dxD  is  a  (25  —  l)-circulant 
matrix  defined  by 

/  m(z,i)  °5-2  Os-2  m(/j3)  ...  m(w)  0  0  ...  0  \ 

025-1  m(u)  05_2  m(/)2)  05_2  m(/i3)  ...  m(W)  0  ...  0  j 


(A(J,2)); 


(m(I,2)){  °5-2  m(Zj3)  0,5 — 2 


o  05_2  (m(/)2))1 


Finally,  after  reordering  the  entries  of  |Mx|2  via  a  permutation  matrix  P  £  {0,  l}DxD,  we  arrive 


at  our  final  form 


P  |  Aix|2  =  M'y  = 


M(  M!2  ...  M's  0  0  ...  0 

0  M(  M!2  . . .  Mg  0  ...  0 


M's  0 


0  ...  M( 


Here  M’  £  <CDxD  is  a  block  circulant  matrix  [38]  whose  blocks,  M(, . . .  ,M'S  £  C^25  1)x(2<5  1))  have 
entries 


■= 


(mJ:)z(mi)J+/_1  if  1  <  j  <5-1  +  1 

0  if5  —  Z  +  2<j<2<5  —  Z  —  1 

(mj)/+i(mj)i+J-_2l5+1  if  25  —  l  <  j  <  25  —  1,  and  l  <  5 
0  if  j  >  1 ,  and  l  =  5 


Let  Ia  denote  the  ax  a  identity  matrix.  We  now  note  that  M1  can  be  block  diagonalized  by  via 
the  unitary  block  Fourier  matrices  Ua  £  with  parameter  a  £  N,  defined  by 

/  -Zry  Ify  .  .  .  Ly  \ 


Ia 

..  Ia 

27TD 

27ri-(cZ—  1) 

Icy  *£  ^  • 

.  .  I  cx.  ®  ^ 

27TD-  (d—  2) 

27ri-(d-2)-(d-l) 

Ia<£  d 

.  .  Ia<£  d 

27ri-(d— 1) 

27ri-(cZ— l)-(d— 1) 

Iae  d 

. .  Iae  d 
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More  precisely,  one  can  see  that  we  have 

/  Ji  0  0  ...  0 

0  J2  0  ...  0 

(10)  u;*6^  M'  U‘25-1  =  J  := 

000  Jd—i  o 
y  o  o  o  o  Jrj 

where  J  £  (CDxD  is  block  diagonal  with  blocks  J\,  ■  ■  ■  ,  Jd  £  (C(2<5-1)x(2'5-1)  given  by 

5 

,  ..  X ^  «  27TD -fc-i 

(11)  4  . 

z=i 

Not  so  surprisingly,  the  fact  that  any  block  circulant  matrix  can  be  block  diagonalized  by  block 
Fourier  matrices  will  lead  to  more  efficient  computational  techniques  below. 

2.1.  Johnson-Lindenstrauss  Embeddings  and  Restricted  Isometries.  Below  we  will  utilize 
results  concerning  Johnson-Lindenstrauss  embeddings  [26,  19,  1,  13,  4,  27]  of  a  given  finite  set 
S  C  C0*  into  Cm  for  m  <  d.  These  are  defined  as  follows: 

Definition  1.  Let  e  £  (0,1),  and  S  C  <Cd  be  finite.  An  mx  d  matrix  A  is  a  linear  Johnson- 
Lindenstrauss  embedding  of  S  into  <Dm  if 

(1  —  e) ||  u  -  v  |||  <\\  Au-  Av  |||  <  (1  +  e) ||  u  -  v  ||| 

holds  Vu,  v  £  S  U  {0}.  In  this  case  we  will  say  that  A  is  a  JL(m,d,e)- embedding  of  S  into  Cm. 

Linear  JL(?n,d,e)-embeddings  are  closely  related  to  the  Restricted  Isometry  Property  [9,  4,  18]. 

Definition  2.  Let  s  £  [d]  and  e  £  (0,1).  The  matrix  A  £  <Dmxd  has  the  Restricted  Isometry 
Property  if 

(12)  (1  —  e)ll  x  |||  <  ||  Ax  |||  <  (1  +  e)||  x  ||| 

holds  Vx  £  containing  at  most  s  nonzero  coordinates.  In  this  case  we  will  say  that  A  is  RIP(s,e). 

In  particular,  the  following  theorem  due  to  Krahmer  and  Ward  [27,  18]  demonstrates  that  a  matrix 
with  the  restricted  isometry  property  can  be  used  to  construct  a  Johnson-Lindenstrauss  embedding 
matrix. 

Theorem  1.  Let  S  C  <Cd  be  a  finite  point  set  with  |<S|  =  M.  For  e,p  £  (0,1),  let  A  £  Cmxd  be 
RIP(2s,e/C\)  for  some  s  >  C2  •  ln(4M/p).4  Finally,  let  B  £  {  —  1,0, 1  }dxd  t,e  a  random  diagonal 
matrix  with  independent  and  identically  distributed  (i.i.d.)  symmetric  Bernoidli  entries  on  its 
diagonal.  Then,  AB  is  a  JL(m,d,e)- embedding  of  S  into  <Dm  with  probability  at  least  1  —  p. 

Below  we  will  utilize  Theorem  1  together  with  a  result  concerning  the  restricted  isometry  property 
for  sub- matrices  of  a  Fourier  matrix.  Let  F  £  <Cdxd  be  the  unitary  d  x  d  discrete  Fourier  transform 
matrix.  The  random  sampling  matrix ,  R'  £  (Cmx<i,  for  F  is  then 


where  R  £  {0,  l}mx<i  is  a  random  matrix  with  exactly  one  nonzero  entry  per  row  (i.e.,  each  entry’s 
column  position  is  drawn  independently  from  [d]  uniformly  at  random  with  replacement).  The 
following  theorem  is  proven  in  [18]. 5 

4Here  Ci,Cz  €  (l,oo)  are  both  fixed  absolute  constants. 

'^See  Theorem  12.32  in  Chapter  12. 

4 


12 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Theorem  2.  Let  p  G  (0,1). 
satisfies  both 

(14) 


If  the  number  of  rows  in  the  random  sampling  matrix  R!  G  <Cmxd 

m  sin2 (8s) ln(8d) 

In  (9m)  —  3  e2 


and 

(15) 


m  >  C4  • 


slog(l/p) 


then  R!  will  be  RIP(2s,e/C\ )  with  probability  at  least  1  —  pfi 


We  are  now  prepared  to  present  and  analyze  our  phase  retrieval  method. 


3.  A  Fast  Phase  Retrieval  Algorithm 

The  proposed  phase  retrieval  algorithm  works  in  two  stages.  In  the  first  stage,  the  vector  y  G  C13 
from  (5)  of  local  entrywise  products  of  x  G  with  its  conjugate  is  recovered  by  inverting  the  block 
circulant  matrix  M'  in  (7).  Next,  a  greedy  algorithm  is  used  to  recover  the  magnitudes  and  phases 
of  each  entry  of  x  from  y  (up  to  a  global  phase  factor).  To  see  how  this  works,  note  that  y  will 
contain  all  of  the  products  x{Xj  for  all  i,j  G  [d]  with  | i  —  j  mod  d\  <6.  As  a  result,  the  magnitude 
of  each  entry  Xj  can  be  obtained  directly  from  XjXj  =  \xj\2.  Similarly,  as  long  as  xfxj  >  0,  one  can 

also  compute  the  phase  difference  arg [xf]  —  arg (xfi)  from  arg  j  •  Thus,  the  phase  of  X{  can  be 

determined  once  arg {xf)  is  established.  Repeating  this  process  allows  one  to  determine  a  network 
of  phase  differences  which  all  depend  uniquely  on  the  choice  of  a  single  entry’s  unknown  phase. 
This  entry’s  phase  becomes  the  global  phase  factor  e10  from  (1).  See  Algorithm  1  for  additional 
details. 


Algorithm  1  Fast  Phase  Retrieval 

Input:  Measurements  |Mx|2  G  M.D  (Recall,  e.g.,  (2)  -  (4)) 

Output:  xGCrf  with  x  «  ®~10x  for  some  6  G  [0,  27r]  as  per  (1) 

1:  Compute  y  =  (M/)_1P|Mx|2  (see  (7)) 

2:  Use  Algorithm  2  with  input  y  G  <CD  to  compute  the  phase  angles,  fij,  of  xj  for  all  j  G  [d] 
3:  Set  Xj  =  y/xjxj  •  for  all  j  G  [d],  where  each  xfxj  is  obtained  from  y 


It  is  important  to  note  that  Algorithm  1  assumes  that  the  block  circulant  matrix  M'  arising 
from  our  choice  of  measurements,  M,  is  invertible.  As  we  shall  see  in  §4  and  §5,  this  is  relatively 
easy  to  achieve.  Similarly,  Algorithm  2  implicitly  assumes  that  x  does  not  contain  any  strings  of 
(5—1  consecutive  zeros  (or,  more  generally,  (5—1  consecutive  entires  with  “very  small”  magnitudes). 
This  assumption  will  also  be  discussed  in  §4  and  §5,  and  justified  for  arbitrary  x  by  modifying 
the  measurements  M.  For  the  time  being,  then,  we  are  left  free  to  consider  to  the  computational 
complexity  of  Algorithm  1. 

3.1.  Runtime  Analysis.  We  will  begin  our  analysis  the  runtime  complexity  of  Algorithm  1  by 
considering  the  computation  of  y  G  in  line  1.  Recalling  §2,  we  note  that  the  permutation 
matrix  P  is  based  on  a  simple  row  reordering  that  clusters  the  first  rows  of  M±, . . . ,  M^s~\  into 
a  contiguous  block,  the  second  rows  of  Mi, . . .  ,M^s-i  into  a  second  contiguous  block,  etc.  (see 
(2)  and  (3)).  Thus,  P|Mx|2  is  simple  to  compute  using  only  0(d  •  (5)-operations.  To  finish 
calculating  y  =  (Af')_1P  |Mx|2  we  then  use  the  decomposition  of  M'  from  (10)  and  compute 
y  =  U25-iJ~1U*s_1P\Mx.\2. 

®Here  C3,  Ca  €  (1,  00)  are  both  fixed  absolute  constants. 
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Algorithm  2  Naive  Greedy  Angular  Synchronization 

Input:  xtXj ,  i,j  =  0,...,d  —  l,  \ i  —  j  mod  d\  <  6. 

Output:  Relative  phase  values:  Zxi,  i  =  0, . . , ,  d  —  1. 

1:  Identify  largest  magnitude  entry  and  set  its  phase  to  zero. 

Zxj  =  0,  j  =  argmax  x{Xi,  i  =  0, . . . ,  d  —  1. 

i 

Note:  We  recover  the  unknown  phases  up  to  a  global  phase  factor. 

2:  Define  a  binary  vector,  phaseFlag  G  {0,  l}d,  to  keep  track  of  entries  whose  phase  has  already 
been  set. 

phaseFlag^  = 


f  0,  i  =  j , 
|  1,  else. 


3: 

4: 

5: 

6: 


while 

for 

if 


phaseFlag^  >  0  do 

*eO',j+< 5) 

i  =  1  —  5,  2  —  5, . . . ,  0, . . . ,  5  —  1  do  {Set  phase  for  the  25—1  entries  nearest  Xj} 
phaseFlagJ+i  mod  d  =  1  then  {Do  not  over-write  previously  set  phases} 


Use  the  reference  phase,  Zxj ,  and  the  computed  phase  differences,  arg 


and  arg 


•Ej  -\-i  mod  d 


to  set  the  phase  of  entry  Xj+i  mod  d 


7 

8 
9 


mod  d  j  “1“  2  ( 


mod  d%j 


—  arg 


%j%j-\-i  mod  d 


phaseFlagj+;  mod  d  =  0. 


end  if 
end  for 

Update  the  reference  phase 


J  —  (  3  U  argmax  mod  dNj+i  mod  d 

V  0<i<5 


mod  d 


10:  end  while 


Recalling  the  definition  of  U2S-1  (9),  one  can  see  that  both  U25-1  and  have  fast  matrix- 

vector  multiplies  (i.e.,  because  they  can  be  computed  by  performing  25—1  independent  fast  Fourier 
transforms  on  different  sub- vectors  of  size  d ).  Hence,  matrix- vector  multiplies  with  both  of  these 
matrices  can  be  accomplished  with  0(6  ■  dlogd)  operations.  Finally,  J  is  block-diagonal  with  d 
blocks  of  size  (25  —  1)  x  (25  —  1)  (see  (11)).  Thus,  J  and  J~l  can  both  be  computed  using  0(d  ■  53) 
total  operations.  Putting  everything  together,  we  can  now  see  that  line  1  of  Algorithm  1  requires 
only  0(d-  63  +  6  ■  d  log  d)  operations  in  general.  Furthermore,  these  computations  can  easily  benefit 
from  parallelism  due  to  the  fact  that  the  calculations  above  are  all  based  on  explicitly  defined  block 
decompositions. 

The  second  line  of  Algorithm  1  calls  Algorithm  2  whose  runtime  complexity  is  dominated  by 
its  main  while-loop  (lines  3  through  10).  This  loop  will  visit  each  entry  of  the  input  vector  y  at 
most  a  constant  number  of  times.  Hence,  it  requires  0(6  ■  d)  operations.  Finally,  the  third  line 
of  Algorithm  1  uses  only  O(d)  operations.  Thus,  the  total  runtime  complexity  of  Algorithm  1  is 
0(d  •  53  +  5  •  dlogd)  in  general. 

4.  Error  Analysis  and  Recovery  Guarantees 

In  this  section  we  analyze  the  performance  of  the  proposed  phase  retrieval  method  (see  Algo¬ 
rithm  1),  and  demonstrate  measurement  matrices  which  allow  it  to  recover  arbitrary  vectors,  up  to 
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an  unknown  phase  factor,  with  high  probability.  Our  analysis  proceeds  in  two  steps.  First,  in  §4.1 
and  §4.2,  we  construct  a  deterministic  set  of  measurements,  M  £  C£)x,i,  w]1jc]1  allow  Algorithm  1 
to  recover  all  relatively  flat  vectors  x  £  <Cd.  Here,  “flat”  simply  means  that  all  entrees  of  x  are 
bounded  away  from  zero  in  magnitude.  The  developed  measurements  M  are  Fourier-like,  roughly 
corresponding  to  a  set  of  damped  and  windowed  Fourier  measurements  of  overlapping  portions  of 
x.  In  addition  to  being  well  conditioned,  these  Fourier  measurements  also  have  fast  inverse  matrix- 
vector  multiplies  via  (an  additional  usage  of)  the  FFT.  Hence,  they  confer  additional  computational 
advantages  beyond  those  already  enjoyed  by  our  general  block  circulant  measurement  setup. 

Next,  in  §4.3,  we  extend  our  deterministic  recovery  guarantee  for  flat  vectors  to  a  probabilistic 
recovery  guarantee  for  arbitrary  vectors.  This  is  accomplished  by  right-multiplying  M  with  a 
concatenation  of  several  Johnson-Lindenstrauss  embedding  matrices,  each  of  which  tends  to  “flatten 
out”  vectors  they  are  multiplied  against.  In  particular,  we  construct  a  set  of  such  matrices  which 
are  both  (i)  collectively  unitary,  and  (ii)  rapidly  invertible  as  a  group  via  (yet  another  usage  of) 
the  FFT.  The  fact  that  this  flattening  matrix  is  unitary  preserves  the  well  conditioned  nature  of 
our  initial  measurements,  M.  Furthermore,  the  fact  that  the  flattening  matrix  enjoys  a  fast  inverse 
matrix-vector  multiply  via  the  FFT  allows  us  to  maintain  computational  efficiency.  Finally,  the  fact 
that  the  flattening  matrix  produces  a  flattened  version  of  x  with  high  probability  allows  us  to  apply 
our  deterministic  recovery  guarantee  for  flat  vectors  to  vectors  which  are  not  initially  flat.  The  end 
result  of  this  line  of  reasoning  is  the  following  recovery  guarantee  for  noiseless  measurements. 

Theorem  3.  Let  x  £  <Cd  with  d  sufficiently  large.  Then,  one  can  select  a  random  measurement 
matrix  M  £  (&Dxd  such  that  the  following  holds  with  probability  at  least  1  —  c  ■ '  Algo¬ 

rithm  1  will  recover  an  x  £  with 


(16) 


min 

6>e[0,27r] 


X  —  ®10x 


=  0 


when  given  the  noiseless  magnitude  measurements  |Mx|2  £  HD .  Here  D  can  be  chosen  to  be 
0(d  ■  ln2(d)  •  In3  (lnd)).  Furthermore,  Algorithm  1  will  run  in  0(d  •  ln3(d)  •  In3  (In  d))-time  in  that 
case. 


In  fact,  we  obtain  a  bit  more  than  this  most  basic  noiseless  recovery  result.  For  example,  we 
derive  explicit  bounds  on  the  condition  number  of  the  measurements  M'  proposed  in  §4.1  (as 
opposed  to  simply  proving  them  to  be  invertible).  Continuing  in  this  vein  one  can,  in  fact,  easily 
prove  rather  ugly  (and  not  terribly  enlightening)  worst-case  recovery  guarantees  for  Algorithm  1 
when  it’s  provided  with  noisy  magnitude  measurements  instead  of  noiseless  ones.  However,  we 
will  leave  a  careful  theoretical  analysis  of  the  robustness  of  Algorithm  1  to  measurement  noise  for 
future  work.  For  now,  we  simply  direct  the  concerned  reader  to  §5  after  noting  that  Algorithm  1 
appears  to  be  highly  robust  to  measurement  noise  in  practice.  We  are  now  ready  to  begin  proving 
Theorem  3. 


4.1.  Well  Conditioned  Measurements.  In  this  section  we  develop  a  set  of  deterministic  mea¬ 
surements  M  £  <CDxd  that  lead  to  well  conditioned  block  circulant  matrices  M'  £  (D^  xD  in  (7). 
To  begin,  we  choose  a  £  [4,  oo)  and  then  set 


(17) 


(m  i)i 


£ 

0 


27ri-(i-l)-0-l) 
.  (B  2 iU 


if  i  <  5 
if  i  >  5 


'Here  C  G  H+  is  a  fixed  absolute  constant. 
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for  1  <  l  <  25  —  1,  and  1  <  i  <  d.  This  leads  to  blocks  M!  £  1)x(2<5  L  with  entries  given  by 


(■ ■=  { 


- - Y  (2(+3-l)/a  —  2nh-(i—l)-(j  —  l) 

0 

,  s.  7 - 7  <E-(2!+i-2(i-l))/a  _27ri-(i-l)-(j-2'5)  . 

(mi)i+1(m1)(+J_2i+1  =  * - - ®  M-1  if  2(5 -Z  <  j  <  26-  1,  l  <  6 

0  if  j  >  1,  and  l  =  5 


ifl<j<5  —  l  +  l 

if  6  -  l  +  2  <  j  <  25  -  l  -  1 


We  will  now  begin  to  bound  the  condition  number  of  this  block  circulant  matrix,  M' ,  by  block 
diagonalizing  it  via  (10). 

Considering  the  entries  of  each  Jk  £  (jT2<5-i)x(25-i)  from  (11)  results  in  two  cases.  First,  suppose 
that  1  <  j  <  5.  In  this  case  one  can  see  that 


(18) 


(19) 


©C  — 27riU— I).(j-1)  27  la 

=  w^=f  •  ®  "-1  •  E  «'  '  • 

7=1 


y/26=l 


27ri-k-l 
©  d  . 


e-(l  +  l)/«  — 27ri-(i  — l)-(j  — 1)  1  -  (E”2(‘5-l  +  1)/a  •  ©"”  k  d  J  +  1) 

©  2'5“1  •  ®  d  •  - 


v/25^1 

Second,  suppose  that  5  +  1  <  j  <  25  —  1.  In  this  case  one  can  see  that 


0  <  27ri-fc 

1  —  ®~2/a  •  ®  d 


(j— 2(<5— l))/a  _2^i.(i-1).0-1) 

(20)  (*),,■  =  ®  ' 


\/2  S-  1 


®  d 


l=28-j 


(21) 


£  (2(5+1)  i)/a  2ni-k(2S—j)  1  —  ©  2L  ^)/a  •  ©  "d  ^ 

•  ©  25-1  .  g  d  ■  ■  ’ 


0  ,  27ri-fc 

1  —  ©~2/a  •  ©  d 


v/2^T 

Let  Fa  £  <D“xa  be  the  unitary  a  x  a  discrete  Fourier  transform  matrix.  Defining 

■  — — d  -  if  1  <  j  <  5 


Sk,j  ■  — 


p-0'+l)/o  .  ^27 ri-fc/d  .  l-e-2('5-l+1)/a.e27ri;fc-(<s-J+1)/d 

^ _ (g  — 2/a  .(g27ri-fc/d 


e 

v  ®-(2(W)-i)/a  .  'friMU-m  .  if  5  +  1  <  J  <  25  -  1 

we  now  have  that 

(  Sfep  0  ...  0  \ 

0  Sk,2  0 

(22)  Jk  =  F2s-  i 

0  0  -.o 

\  0  ...  0  Sjfc,25-1  / 

Note  that  the  condition  number  of  J,  and  therefore  of  M' ,  will  be  dictated  by  the  singular  values 
of  these  matrices.  Thus,  we  will  continue  by  developing  bounds  for  the  singular  values  of  each 
Jk  £  C(2l5-1)x(2<5-1). 

The  fact  that  F25-1  is  unitary  implies  that 


(23)  min  \sk  A  <  a2s~  1  (Jk)  <  01  (*4)  <  max  \sk  j\ 

is  [25-1]  iS[25-l] 

for  all  A;  £  [d],  Thus,  we  will  now  devote  ourselves  to  bounding  the  maximum  and  minimum  values 
of  from  above  and  below,  respectively,  over  all  k  £  [d]  and  j  £  [25  —  1],  These  bounds  will 
then  collectively  yield  an  upper  bound  on  the  condition  number  of  our  block  circulant  measurement 
matrix  M’ .  The  following  simple  technical  lemmas  will  be  useful. 

Lemma  1.  Let  x  £  [2,  00).  Then,  1  —  <£~l/x  >  2^‘f/x  >  2~ ^  >  ^yy. 
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Proof:  Note  that  1  -  «  l/x  =  E“=  i  >  l  ~  (2  ~  En=o  g"(n+i)l)  >  Furthermore, 

the  numerator  is  a  monotonically  increasing  function  of  x.  □ 

Lemma  2.  Let  a,  b,  c  E  R+,  and  f  :  11  — >  R  below.  Then, 

(1)  f(x)  =  b  ■  <$rx!a  (1  +  c  •  e2l/a)  has  a  unique  global  minimum  at  x  =  —  |  ln(c),  and 

(2)  f(x)  =  b  ■  e~x/a  (1  —  c  •  ®2a:/a)  is  monotonically  decreasing. 

Proof:  In  either  case  we  have  that  f'(x)  =  —  ^  •  <R~x/a  ±  ^  •  ®x/“,  and  f"(x)  =  ^  ■  e~x/a  ±  •  <sx/a. 

For  (1)  we  have  a  single  critical  point  at  x  =  —  |  ln(c),  which  is  a  global  minimum  since  f"(x) 

0  Vr  G  R.  For  (2)  we  have  f'(x)  <  0  for  all  x  E  R. 

Note  that 


(24)  —  < 


~-(j+l)/a  .  /  1+«E  4(*  t+1)/a-2e  2(s  j+1)/a  cos(27r-[<5— jr+l]-fc/ d)  X 

1  1+<e^4/“— 2®_2/°  cos(2-?r k/d)  — 


(P-(2(<5+l)-j)/o  .  ,  /  1+e  40  ^)/g-2e  2U  a)/a  cos(27r-[j-5]-fc/d)  r  ,  <  „•  <  nx  _  i 

'  l+e_4/“— 2®~2/a  cos(2irk/d)  ■' 


Fix  A;  E  [d],  When  1  <  j  <  5  we  have 


(25) 


i  i  /  -(''7+1') /a  1  u 

max  Sfc  ?-  <  max  <e  u  "  • - 

.?e[5]  ’  je[<5]  1 


1  +  ®  f)/a^  ®  2/a(l  +  ®  25/a) 


je[«]  1  “  je[<5]  y  1  -  «~2/a  J  ~  1  -  ®-2/q 

where  the  second  inequality  follows  from  part  one  of  Lemma  2.  When  5  +  1  <  j  <  25  —  1  we  have 


(26)  max  Isi.  d  <  max  ® 

je[25-i]\[5]  u  je[2<5-l]\[«5]  ' 


-(2(5+1)-,)/*  1  +  ®-2^)/°\  ®— 3/a(1  +  (^—2(&—l)/a) 


1  —  ®~2/a 


1  —  ®-2/a 


where  the  second  inequality  again  follows  from  part  one  of  Lemma  2.  Finally,  combining  (25)  and 
(26)  one  can  see  that 

®~2/“(l  +  ®~2<5/a)  ®~2/a(l  +  ®“2<5/a)  20«~2/a  —It  a 

(2  0  °4  (Jk)  <  ~  _ 9 in  <  a'  7777,  777  <  a'  n  <  3a  •  ®  , 


1  —  ®~2/a 


2(2  -  ®2/a) 


7 


where  the  second  inequality  follows  from  Lemma  1  with  a  E  [4,  oo). 

Turning  our  attention  to  the  lower  bound,  we  note  that  part  two  of  Lemma  2  implies  that 


(28) 


min  I  Sk  j  I  >  min  ® 
ie[5]  ’  je[5]  \ 


_(j+l )/a  1  ®-(<5+l)/ffl(1  -  ®~2/°) 


1  +  «-2/q 


1  +  ®  2/q 

Similarly,  part  two  of  Lemma  2  also  ensures  that 

'  —(2(5+1  )-,')/*  .  l-C-2^)/^  >  ®-^+l)/a(l-®-2/a) 


(29)  min  1,+, ,  I  >  min  ® 

je[25-l]\[5]  je[25-l]\[5]  1 


1  +  ®~2/q 


1  +  «-2/° 


Combining  (28)  and  (29)  we  see  that 


(30) 


fnx«  (,5+1)/a(  1  -  «  2/“)  ^  ‘  —  (5+l)/a 

ffM-‘ <Jt)  - - ITFV; -  >  20+  ® 


7 


where  the  second  inequality  follows  from  Lemma  1  with  a  E  [4,  oo).  We  are  now  equipped  to  prove 
the  main  theorem  of  this  section. 
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□  V 


Theorem  4.  Define  M1  G  <CDxD  via  (17)  with  a  :=  max  {4,  ^-9^}.  Then, 

k  (M;)  <  max  |l44®2,  •  (5  —  1)2|  . 


Proof:  We  have  from  (27)  and  (30)  that 

(31)  « (M')  =  ai  <  maXfc£M  *1  W  <  9ffl2  .  e(,-i)/« 

aD(M')  aD(J)  mmke[d]  (J25-1  (Jk) 

Minimizing  the  rightmost  upper  bound  as  a  function  of  a  yields  the  stated  result. 


□ 


Theorem  4  guarantees  the  existence  of  measurements  which  allow  for  the  robust  recovery  of  the 
phase  difference  vector  y  G  <CD  defined  in  (5).  In  the  next  three  subsections  we  analyze  the  recovery 
of  x  G  from  y  via  the  techniques  discussed  in  §3. 


4.2.  A  Recovery  Guarantee  for  Flat  Vectors.  As  mentioned  in  §3,  Algorithm  1  implicitly 
assumes  that  x  G  does  not  contain  any  strings  of  6  —  1  consecutive  entires  with  very  small 
magnitudes  (mod  d) .  We  will  refer  to  such  vectors  as  being  “flat” .  More  specifically,  we  will  utilize 
the  following  more  concrete  definition. 


Definition  3.  Let  m  G  [d].  A  vector  u  G  <Cd  will  be  called  m-flat  if  its  entries  can  be  partitioned 
into  at  least  |_^J  contiguous  blocks  such  that: 


(1)  Every  block  contains  either  m  or  m  +  1  entries, 

(2)  Every  block  contains  at  least  one  entry  whose  magnitude  is  > 

(3)  All  entries  of  u  have  magnitude  <  \/3t%1~3  •  1 1 u 1 1 2  - 


I|U||2 
2  Vd  ’ 


and 


(5-1) 

2 


Note  that  Algorithm  1  will  always  successfully  recover 

exists.  To  see  why,  it  suffices  to  consider  the  main  while- loop  of  A1 
10).  In  particular,  line  6  will  always  succeed  in  computing  the  correct  (relative)  phase  of  the  entry 
in  question  as  long  as  \xj\  >  0.  Furthermore,  such  a  j  will  always  be  discovered  in  line  9  if  x  is 


flat  vectors  whenever  ( M ')  1 
gorithm  2  (i.e.,  lines  3  through 


(5-1) 

2 


-flat.  This  observation  leads  us  to  the  following  theorem. 


Theorem 

(5-1) 
2 


m  < 


Let  M  G  <CDxd  be  defined  as  in  §f.l,  and  suppose  that  x  G  C d  is  m-flat  for  some 
Then,  Algorithm  1  will  recover  an  x  G  CcZ  with 


(32) 


min 

0G[O,27t] 


x  —  ®10x 


=  0 


when  given  the  noiseless  input  measurements  |Afx|2  G  RD.  Furthermore,  Algorithm  1  requires  just 
0(6  ■  dlogd)  operations  in  this  case. 

Proof:  The  recovery  guarantee  (32)  follows  from  Theorem  4  together  with  the  preceding  paragraph. 
The  runtime  complexity  of  Algorithm  1  simplifies  to  0(6  ■  dlogd)  operations  when  using  the  mea¬ 
surements  defined  in  §4.1  because  the  matrix  J  ends  up  having  a  simple  factorization  (see  (22)).  □ 


Of  course,  not  all  vectors  are  flat.  We  remedy  this  defect  in  the  next  subsection. 
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4.3.  Flattening  Arbitrary  Vectors  with  High  Probability.  Let  W  G  <&dxd  be  the  random 
unitary  matrix 


(33) 


W  :=  PFB , 


where  P  G  {0,  l}dxrf  is  a  permutation  matrix  selected  uniformly  at  random  from  the  set  of  all 
d  x  d  permutation  matrices,  F  is  the  unitary  d  x  d  discrete  Fourier  transform  matrix,  and  B  G 
{  —  1,0,  \}dxd  is  a  random  diagonal  matrix  with  i.i.d.  symmetric  Bernoulli  entries  on  its  diagonal. 
For  any  given  m  G  [d] ,  one  can  naturally  partition  W  into  blocks  of  contiguous  rows,  each 
of  cardinality  either  m  or  m  +  1.  This  defines  the  |_^J  sub-matrices  of  W,  Wi, . . . ,  Wd_m^_d  j  £ 
C(m+I)xd  and  w  ^  . .  .  .ir  ,;j  €  c mxd,  by 


(34) 


w  = 


(  w1 
w2 

V  wlii  ) 


Note  that  each  renormalized  sub-matrix  of  W,  y  ^  •  Wj  for  j  G  is  “almost”  a  random 

sampling  matrix  (13)  times  a  random  diagonal  Bernoulli  matrix.  As  a  result,  Theorems  1  and  2 
suggest  that  each  \J ^  •  Wj  should  behave  like  a  JL(m,d,e)-embedding  of  our  signal  x  into  ©m  (or 

<Dm+1).  If  true,  it  would  then  be  reasonable  to  expect  that  each  block  of  m  consecutive  entries  of 
Wx  should  have  roughly  the  same  ^momi  as  one  another.  This,  in  turn,  suggests  that  the  random 
unitary  matrix  W  should  effectively  flatten  x  with  high  probability,  especially  when  m  is  small. 

Of  course,  there  are  several  small  difficulties  that  must  be  addressed  before  the  argument  above 

can  be  made  rigorous.  First,  the  rows  of  F  contributing  to  •  Wj  are  effectively  independently 
sampled  uniformly  without  replacement  from  the  set  of  all  rows  of  F  by  our  choice  of  P.  This  means 
that  Theorem  2  does  not  strictly  apply  in  our  situation  since  we  can  not  select  any  row  of  F  more 
than  once.  Secondly,  some  care  must  be  taken  in  order  to  select  the  smallest  value  of  rn  possible 
in  (34),  since  Wx  will  “become  flatter”  as  m  decreases.  As  a  result,  m  will  effectively  provide  a 
theoretical  lower  bound  on  the  size  of  5  that  one  can  utilize  and  still  be  guaranteed  to  accurately 
recover  Wx  via  our  §3  techniques  (recall  also  §4.2  above).  We  are  now  ready  to  begin  proving  our 
main  result  concerning  W. 

The  following  simple  lemma  will  be  used  in  order  to  help  adapt  Theorem  2  to  the  situation  where 
the  rows  of  F  are  sampled  uniformly  without  replacement. 


Lemma  3.  Let  m  G  N  with  m  <  \J~d.  Independently  draw  x±, . 
with  replacement.  Then,  F  [|{.xi, . . .  ,xm}\  =  m]  >  1/2. 


,  xm  from  [d]  uniformly  at  random 


Proof:  A  short  induction  argument  establishes  that 

771—  1  / 

(35)  P  [|{xi,...,xm}|  =m]=  (l  —  >1—  ^  =  1  — 


771—1  . 

3_ 
d 


m  —  m 


3= 1  V  7  3= 1 

The  result  now  follows  easily  via  algebraic  manipulation. 


2d 


□ 


The  following  corollary  of  Theorem  2  now  demonstrates  that  a  random  sampling  matrix  R' 
formed  by  sampling  a  subset  of  rows  of  size  m  uniformly  at  random  from  F  will  still  be  RIP(2.s,e/C'i) 
with  high  probability. 

li 


19 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Corollary  1.  Let  p  £  (0,1).  Form  a  random  sampling  matrix  R'  £  <Dmxd  jyy  independently 
sampling  m  rows  from  F  uniformly  without  replacement.  If  the  number  of  rows,  m,  satisfies  both 


(36) 
and 

(37) 


V~d  >  m  >  C3  ■ 


s  In2 (8s)  ln(8d)  ln(9m) 


\fd  >  m  >  C4  ■ 


s  log(2/p) 


then  R'  will  be  RIP(2s,e/C\)  with  probability  at  least  1  —  p. 


Proof:  Let  S  :=  {x\, . . .  ,xm},  where  each  Xj  £  [d]  is  selected  independently  and  uniformly  at 
random  from  [d]  (with  replacement).  Similarly,  let  S'  C  [d]  be  a  subset  of  [d]  chosen  uniformly 
at  random  from  all  subsets  of  [d]  with  cardinality  m  (i.e.,  let  S'  contain  m  elements  sampled 
independently  and  uniformly  from  [d]  without  replacement).  Furthermore,  let  E  denote  the  event 
that  the  random  sampling  matrix  whose  rows  from  F  are  x\, . . . ,  xm  is  not  RIP(2s,e/Ci).  Finally, 
let  E'  denote  the  event  that  the  random  sampling  matrix  whose  rows  from  F  are  the  elements  of 
S'  is  not  RIP(2s,e/C'i).  Applying  Lemma  3  we  can  now  see  that 

(38)  F  [E\  >  P  [E  |  |<S|  =  m]  ■  P  [|«S|  =  m\  =  F  [E']  •  F  [|«S|  =  m\  >  \  •  F  [E']  . 

The  stated  result  now  follows  from  Theorem  2.  □ 


We  are  now  ready  to  prove  that  W  will  flatten  the  signal  x  £  C';  with  high  probability  provided 
that  m  can  be  chosen  appropriately.  We  have  the  following  theorem: 

Theorem  6.  Let  W  £  <Ddxrf  be  formed  as  per  (33)  for  d  >  8.  Then,  Wx  £  will  be  m-flat  with 

probability  at  least  1  —  B  provided  that  \fd  >  m  +  1  >  C5  •  ln2(d)  •  In3  (lnd).8 

Proof:  Our  first  goal  will  be  to  show  that  each  W\, . . . ,  W\  d_  1  from  (34)  is  a  is  a  rescaled  JL(m,d,l/2)- 

L  m  J 

embedding  of  {x}  into  Cm  (or  Cm+1 ).  This  will  guarantee  that  each  consecutive  block  of  m ,  (or 
m  +  1)  entries  of  Wx  has  roughly  the  same  ^2-norm. 

To  achieve  this  goal  we  will  apply  Theorem  1  to  each  J  ^  •  W\ , . . . ,  \  ^  •  Wi  d_  1  in  order  to  show 

that  each  one  embeds  {x}  into  Cm  (or  Cm+1)  with  probability  at  least  1  —  ^7.  The  union  bound 

will  then  imply  that  {x}  is  embedded  by  all  the  ■  Wj  with  probability  at  least  1  —  This 

argument  will  go  through  as  long  as  each  y ^  •  W\B~l , . . . ,  y/ ^  is  RIP(2s,l/2C'i)  for 

some  s>C2-  ln(8d).  Hence,  we  will  now  focus  on  determining  the  range  of  m  which  guarantees 
that  all  of  these  matrices  are  RIP(|"2C,2  •  ln(8d)~|  ,l/2Ci). 

To  demonstrate  that  each  y ^  ■  TFjH”1  is  RIP(|"2C,2  •  ln(8d)]  ,l/2Ci)  with  probability  at  least 
1  —  P  one  may  apply  Corollary  1  with  m  (or  m.  +  1)  chosen  as  above  (assuming  d  >  8).  Another 
application  of  the  union  bound  then  establishes  that  all  of  y  ^  •  \V\B~1 , . . . ,  \  ^  •  Wi  d_  1  B^1  will  be 
RIP(  \2C2  -ln(8d)]  ,\f2C\)  with  probability  at  least  1  —  2777 •  One  final  application  of  the  union  bound 
then  establishes  our  first  goal:  All  of  W\, . . . ,  -  W^d_ j  will  be  JL(/n,d,l/2)-embeddings  of 

{x}  with  probability  at  least  1  —  B. 


8Here  C5  £  R+  is  a  fixed  absolute  constant. 
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To  finish  the  proof,  we  now  note  that  Wx  will  be  ra-flat  whenever  all  of  the  y  ^  •  Wj 
matrices  are  JL(ra,d,l/2)-embeddings  of  {x}.  To  see  why,  suppose  that 

^l|x|||<^x||I<§||x||I. 

This  implies  that  |  x  ||2  >  | >  S||x||2,  which  can  only  happen  if  both  of  the  following 
hold:  (i)  at  least  one  entry  of  WjX  has  magnitude  at  least  ,  and  (ii)  all  entires  of  WjX 

have  magnitude  less  than  \/3”^t3 ||x|| 2  =  \/3ToJ~3 Il^xll2-  This  proves  the  theorem.  □ 


Theorem  6  now  allows  us  to  alter  our  measurements  so  that  we  can  recover  arbitrary  vectors. 
We  are  now  ready  to  prove  Theorem  3. 

4.4.  Proof  of  Theorem  3.  We  set  our  measurement  matrix  M  £  <^,Dxd  to  be  M  :=  MW  where 
M  £  <^Dxd  is  defined  as  in  §4.1,  and  W  £  (&dxd  is  as  defined  as  in  (33).  Theorem  6  guarantees  that 
Wx  will  be  m  =  0(ln2(c?)  •  In3  (In c/))-fiat  with  probability  at  least  1  —  c<  hrffiOhiAi uci)  Provided  that 
d  is  sufficiently  large.  Furthermore,  if  Wx  is  m-flat  and  5  >  2m  +  1,  then  Theorem  5  guarantees 
that  Algorithm  1  will  recover  an  x'  £  <Cd  satisfying 

(39)  min  Wx  —  <sldx  =  0 

6»e[0,27r]  2 

when  given  the  noiseless  input  measurements  |MIFx|2  £  lb0.  Hence,  choosing  6  =  0(ln2(d)  ■ 
In3  (Inc?))  allows  us  to  recover  x!  =  W  (®1<^>x),  for  some  unknown  phase  (f)  £  [0,  2n],  with  probability 
at  least  1  —  ,,  ,  — t-.9  We  then  set  x  =  W*x! . 

C$  -lnJ  (d)  -ln^  (In  d) 

Considering  the  runtime  complexity,  we  note  that  x'  can  be  obtained  in  0(6  ■  dlogd)  =  0(d  ■ 
In 3(d)  •  In3  (Inc?))  operations  by  Theorem  5.  Computing  W*x!  can  then  be  done  in  0(c?logc?)  oper¬ 
ations  via  an  inverse  fast  Fourier  transform.  The  stated  runtime  complexity  follows. 


It  is  interesting  to  note  that  alternate  constructions  of  flattening  matrices,  W,  with  fast  inverse- 
matrix  vector  multiplies  can  also  be  created  by  using  sparse  Johnson-Lindenstrauss  embedding 
matrices  in  the  place  of  our  Fourier-based  matrices  (see,  e.g.,  [7]).  Thus,  one  has  several  choices  of 
matrices  W  to  use  in  concert  with  a  given  block-circulant  measurement  matrix  M  in  principle. 

5.  Empirical  Evaluation 

We  now  present  numerical  results  demonstrating  the  efficiency  and  robustness  of  the  phase 
retrieval  algorithm  1.  We  test  our  algorithm  on  unit-norm  i.i.d  zero-mean  complex  random  Gaussian 
test  signals.  To  test  noise  robustness,  we  add  i.i.d  random  Gaussian  noise  to  the  squared  magnitude 
measurements  at  desired  signal  to  noise  ratios  (SNRs);  i.e., 

(40)  y  =  |Mx|2  +  n, 

where  y  £  denotes  the  noisy  measurement  vector  and  the  noise  n  £  WD  is  chosen  to  be  i.i.d 
AT(0,  u2Id)-  The  variance  a2  is  chosen  such  that 

SNR  (dB)  =  101og10(^|2). 

Errors  in  the  recovered  signal  are  also  reported  in  dB  with 

Error  (dB)  =  101og10 

9The  probability  estimate  in  Theorem  3  follows  immediately  with  C  —  C5. 
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where  x  denotes  the  recovered  signal.  Matlab  code  used  to  generate  the  numerical  results  is  freely 
available  at  [25]. 

We  start  by  presenting  numerical  simulations  demonstrating  the  efficiency  of  the  block  circulant 
construction  introduced  in  this  paper.  In  particular,  we  plot  the  execution  time  for  solving  the 
phase  retrieval  problem  (averaged  over  100  trials)  in  Figure  1.  Simulations  were  performed  on 
a  laptop  computer  with  an  Intel®  Core™i3-3120M  processor,  4GB  RAM  and  Matlab  R2014b. 
For  comparison,  we  also  plot  execution  times  for  the  Gerchberg-Saxton  [20,  35]  alternating  pro¬ 
jection  and  PhaseLift  algorithms.10  In  each  case,  we  recover  a  random  complex  Gaussian  signal 
from  noiseless  magnitude  measurements.  We  consider  two  cases:  (i)  Figure  la,  which  plots  the 
execution  time  for  solving  the  phase  retrieval  problem  using  5 D  measurements  (suitable  for  high 
SNR  applications),  and  (ii)  Figure  lb,  which  plots  the  execution  time  when  4dlogd  block  circulant 
measurements  are  used  (suitable  for  generic  applications  at  a  wide  range  of  SNRs).  Both  plots 
confirm  the  log-linear  execution  time  for  implementing  Algorithm  1.  Moreover,  it  is  clear  that 
the  block  circulant  construction  introduced  here  is  several  orders  of  magnitude  faster  than  equiv¬ 
alent  methods,  thereby  allowing  us  to  solve  high-dimensional  problems  previously  thought  to  be 
computationally  infeasible. 


(a)  Execution  time  -  Phase  Retrieval  from  5 D  mea-  (b)  Execution  time  -  Phase  Retrieval  from  4dlogd 
surements.  measurements. 

Figure  1.  Computational  Efficiency  of  the  Block-Circulant  Phase  Retrieval  Algorithm 

We  next  demonstrate  robustness  to  additive  noise.  Figure  2a  plots  the  reconstruction  error  in 
recovering  a  d  =  64  complex  random  Gaussian  signal  at  different  SNRs,  with  each  data  point 
computed  as  the  average  of  100  trials.11  We  include  reconstruction  results  using  the  Gerchberg- 
Saxton  alternating  projection  and  PhaseLift  algorithms  for  comparison.  The  deterministic  win¬ 
dowed  Fourier- like  measurements  introduced  in  §4.1  were  used  for  the  block  circulant  construction, 
while  complex  random  Gaussian  measurements  were  used  for  the  other  methods.  We  observe 
that  all  methods  recover  the  underlying  signal  to  the  level  of  noise,  although  the  block  circulant 
construction  requires  approximately  twice  the  number  of  measurements  as  the  other  methods. 

10Simulation  results  using  PhaseLift  and  the  Gerchberg-Saxton  alternating  projection  algorithm  use  random 
complex  Gaussian  measurements. 

A  few  iterations  of  the  alternating  projection  algorithm  were  used  to  post-process  the  reconstructions. 
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For  completeness,  we  also  plot  the  reconstruction  error  for  a  larger  problem  [d  =  2048)  in 
Figure  2b  for  three  different  number  of  measurements  (D)  and  using  the  deterministic  measurement 
construction.  We  note  that  the  dimensions  of  this  problem  would  make  it  be  computationally 
intractable  (on  a  conventional  laptop  or  desktop  machine  implemented  in  Matlab)  for  methods 
such  as  Gerchberg-Saxton  or  PhaseLift. 


20  30  40  50  60 

Noise  Level  in  SNR  (dB) 


(a)  Robustness  to  Additive  Noise  ( d  =  64).  (b)  Robustness  to  Additive  Noise  (d  =  2048). 


20  30  40  50  60 

Noise  Level  in  SNR  (dB) 


(c)  Recovery  using  Random  Masks  (d  =  2048). 

Figure  2.  Robustness  to  Additive  Noise  of  the  Block-Circulant  Phase  Retrieval 
Algorithm 

To  illustrate  the  flexibility  of  the  measurement  construction  introduced  in  this  paper,  we  also 
include  results  using  random  masks  in  Figure  2c.  In  particular,  the  entries  of  the  block  circulant 
measurement  matrix  are  chosen  to  be  i.i.d.  standard  complex  Gaussian.  Moreover,  we  may  fix  the 
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block  length  <5  and  collect  oversampled  measurements  to  improve  the  noise  robustness  of  the  recovery 
algorithm.  In  Figure  2c,  the  block  length  was  fixed  to  be  5  =  [2d  log  d] ,  oversampled  measurements 
(by  factors  of  1.5,2  and  3)  were  used  to  recover  the  d  =  2048  length  i.i.d  complex  Gaussian 
test  signal.  The  figure  confirms  that  the  random  block-circulant  construction  also  demonstrates 
robustness  to  additive  noise  across  a  wide  range  of  SNRs,  while  the  reconstruction  accuracy  improves 
with  the  oversampling  factor. 

Finally,  Figure  3  plots  the  condition  number  of  the  system  matrix  used  to  solve  for  the  phase 
differences  (matrix  M1  in  §2)  for  the  deterministic  block  circulant  measurement  construction  in¬ 
troduced  in  §4.1.  The  figure  plots  the  condition  number  as  a  function  of  the  block  length  5  for 
d  =  64. 12  It  confirms  that  the  condition  number  scales  as  a  small  multiple  of  52.  The  figure  also 
includes  a  plot  of  the  condition  number  when  using  random  masks  at  an  oversampling  factor  of 
1.5. 


Figure  3.  Well  Conditioned  Measurements  -  Condition  Number  as  a  Function  of 
the  Block  Length  5 


6.  Sublinear-Time  Phase  Retrieval  for  Compressible  Signals 

In  this  section  we  briefly  focus  on  the  compressive  phase  retrieval  setting,  (see,  e.g.,  [34,  36,  28, 
41,  16,  37]),  where  one  aims  to  approximate  a  sparse  or  compressible  x  E  using  fewer  magnitude 
measurements  than  required  for  the  recovery  of  general  x.  It  is  known  that  robust  compressive 
phase  retrieval  for  s-sparse  vectors  is  possible  using  only  0(slog(d/s))  magnitude  measurements 
[16,  24],  In  this  section  we  prove  that  it  is  also  possible  to  recover  s-sparse  vectors  x  E  <Dd  up  to  an 
unknown  phase  factor  in  only  0(slog6  d)-time  using  0(slog5  d)  magnitude  measurements.  Thus, 
we  establish  the  first  known  nearly  runtime-optimal  (i.e. ,  essentially  linear-time  in  s )  compressive 
phase  retrieval  recovery  result.  In  particular,  we  prove  the  following  theorem. 

Theorem  7.  There  exists  a  deterministic  algorithm  A  :  -A  Gd  for  which  the  following  holds: 

Let  e  E  (0,1],  x  E  <Drf  with  d  sufficiently  large,  and  s  E  [d] .  Then,  one  can  select  a  random 


12  The  condition  number  is  independent  of  the  problem  dimension  d  and  depends  only  on  the  block  length  5. 
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measurement  matrix  M  £  C 


such  that 


Dxd 


(41) 


mm 

0G[O,27r] 


®10x  —  A 


\Mx\ 


< 


X  —  X 


opt  I 


22e 


+ 


x  —  x 


opt 

(s/«) 


is  true  with  probability  at  least  1 —  (,  -13  Here  D  can  be  chosen  to  be  d?(|-ln3(|)-ln3  (In  |) 

•In  ctj .  Furthermore,  the  algorithm  will  run  in  O  (|  •  ln4(|)  •  In3  (In  •  In  d)  - time  in  that  case.14 

We  prove  Theorem  7  by  following  the  generic  compressive  phase  retrieval  recipe  presented  in  [24] . 
Let  C  £  Cmxrf  be  any  compressive  sensing  matrix  with  an  associated  sparse  approximation  algo¬ 
rithm  A  :  Cm  — >  (Cd  (see,  e.g.,  [8,  10,  39,  31,  5,  32,  33]),  and  let  P  £  (CDxm  be  any  phase  retrieval 
matrix  with  an  associated  recovery  algorithm  <h  :  RD  — >  <Dm.  Then,  Ao$  :  — y  (Dd  will  approx¬ 

imately  recover  compressible  vectors  x  £  <Cd  up  to  an  unknown  phase  factor  when  provided  with 
the  magnitude  measurements  \PCx\.  That  is,  one  may  first  use  <J>  to  recover  elc^(Cx)  =  C'(c1'^x) 
for  some  unknown  <f>  £  [0,  2tt]  from  \PCx\ ,  and  then  use  A  to  recover  e^x  from  C(elr^x).  If  both 
<f>  and  A  are  efficient,  the  result  will  be  an  efficient  sparse  phase  retrieval  method. 

Herein  we  will  utilize  Algorithm  1  as  our  phase  retrieval  method.  Note  that  it’s  runtime  is  only 
0(dlog4  d),  making  it  optimal  up  to  log  factors  (recall  Theorem  3).  For  the  compressive  sensing 
method  we  will  utilize  the  following  algorithmic  result  from  [23]. 

Theorem  8.  Let  e  £  (0,1],  a  £  [2/3,1),  x  £  <Cd,  and  s  £  [d].  With  probability  at  least  a  the 
deterministic  compressive  sensing  algorithm  from  [23]  will  output  a  vector  z  £  satisfying 


22e 


(42) 


x  —  z 


I2  — 


X  —  X 


opt  I 


+ 


X  —  X 


opt 

(s/e) 


when  executed  with  random  linear  input  measurements  A4x  £  <Cm .  Here  m  =  O  •  In  )  hid 
suffices.  The  required  runtime  of  the  algorithm  is  O  ^  •  In  (pry)  In  (nr/))  in  ^lls  case-15 


Theorem  7  now  follows  easily  from  Theorems  3  and  8. 
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Abstract 

This  paper  discusses  the  detection  of  edges  from  two-dimensional  truncated  Fourier  spectral 
data.  Compared  to  edge  detection  from  pixel  data,  this  is  a  more  challenging  problem  since  we 
seek  accurate  local  information  from  a  small  number  of  often  noisy  global  measurements.  We 
propose  a  highly  effective  algorithm  using  a  specific  class  of  spectral  modifiers  which  converges 
uniformly  to  sharp  peaks  along  the  singular  support  of  the  function.  We  provide  theoretical 
guarantees  and  numerical  simulations  to  show  that  the  resulting  edge  map  is  free  of  spurious 
edges  and  oscillations. 

1  Introduction 

The  detection  of  jump  discontinuities  in  piece- wise  smooth  functions  is  an  important  task  in  sev¬ 
eral  areas  of  science  and  engineering.  For  example,  many  image  and  video  processing  operations 
such  as  segmentation  and  feature  extraction  rely  on  the  accurate  identification  of  edges  in  the 
underlying  image  (see  for  example  [1,  Chapter  10]  for  a  discussion).  Similarly,  high-order  meth¬ 
ods  for  the  numerical  solution  of  PDEs  often  incorporate  jump  information  when  the  solution  is 
piece- wise  smooth  [2,  Chapter  9].  Although  edge  detection  is  a  non- trivial  problem  (especially 
when  dealing  with  discrete  and/or  quantized  data,  and  in  the  presence  of  noise),  efficient  and 
accurate  algorithms  such  as  the  (W)ENO  schemes,  [3,4]  and  the  Canny  edge  detector,  [5]  exist 
for  identifying  edge  locations  when  we  start  with  physical  space  or  pixel  data.  Certain  applica¬ 
tions,  however,  require  that  we  extract  edge  information  starting  with  spectral  data.  The  most 
common  example  is  magnetic  resonance  imaging  (MRI),  where  the  underlying  physics  of  nuclear 
magnetic  resonance  implies  that  the  MR  scanner  collects  samples  of  the  Fourier  transform  of 
the  specimen  being  imaged.  Identifying  edges  from  such  data  is  a  significantly  more  challeng¬ 
ing  problem  since  we  seek  accurate  local  information  from  a  small  number  of  often  noisy  global 
measurements . 

We  begin  by  illustrating  this  problem  in  one  dimension.  Consider  the  piece- wise  smooth  test 
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function  /  :  [0, 1)  — >  ffi. 


/O) 


a(x)  sin(7rx), 


a(x)  = 


2  *  e  [O’  a) 

0  x&[\,\) 

1  *e[  b() 

-l  x  e  [|,  l)  ■ 


(l.i) 


The  jump  discontinuities  in  /  are  completely  described  by  its  associated  jump  function,  [/], 
defined  as 


[/](*) 


f(x+)  -  f(x  )  x  €  (0,1) 

/(0+)-/(l“)  *  =  0. 


(1.2) 


Given  the  first  21V  +  1  Fourier  coefficients  of  /, 


f(k)  =  I'  f{x)e~2”ikxdx,  k  =  —N,  ...,1V, 

Jo 

how  do  we  identify  the  locations  and  values  of  its  jump  discontinuities,  i.e.,  how  do  we  approx¬ 
imate  [/]?  The  naive  approach  would  be  to  compute  the  2 N  +  1  mode  Fourier  partial  sum 
approximation  of  /  on  an  equispaced  grid 


SnHxj)  =  J2  f{k)e2nikXj ,  =  = 

|fe|<JV 


followed  by  the  application  of  a  local  differencing  scheme  such  as  the  (undivided)  forward  dif¬ 
ference  operator 


D+SN  f(xj) 


SNf(xj+1)-SNf(Xj)  je[0,AT-2] 
SnKx0)  ~  SNf(xN_i)  j  =  N-  1. 


(1.3) 


The  results  using  such  an  approach  are  shown  in  Fig.  la,  where  /,  Siyf  and  D+Sn  f  are  plotted 
using  dashed,  solid  (red)  and  solid(blue)  lines  respectively.  A  simple  detector  function  of  the 
form 


£(xj)  = 


D+SNf(xj)  \D+SNf{xj)\  >  \D+SNf(xu±1))\ ,  D+SNf(xj)  >  7 
0  else, 


(1.4) 


is  used  to  extract  jump  information  from  D+Spff,  where  7  is  a  detection  threshold.  Since 
,Sjy/  (and  consequently,  D+Swf )  is  a  Fourier  approximation  of  a  piece- wise  smooth  function, 
it  suffers  from  non-physical  Gibbs  oscillations.  The  largest  of  these  (which  are  9%  of  the  cor¬ 
responding  jump  height)  are  observed  to  be  of  the  same  order  of  the  smallest  jump  in  Fig. 
la.  Unsurprisingly,  the  detector  function  (1.4)  mistakes  these  oscillations  for  legitimate  edges. 
Therefore,  the  challenge  in  detecting  jump  discontinuities  from  Fourier  data  is  to  distinguish 
these  non-physical  Gibbs  oscillations  from  legitimate  edges,  or,  to  eliminate  them  entirely. 

The  latter  approach  was  pursued  by  Cochran  et.  al.  in  [6],  where  the  detection  of  jump 
discontinuities  from  one-dimensional  truncated  Fourier  data  using  a  special  class  of  spectral 
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Finite  Difference  Jump  Approximation 


Gibbs-Free  Jump  Approximation 


(a)  (Undivided)  Finite  Differences  (b)  Spectral  Modifiers 


Figure  1:  Jump  Detection  from  one-dimensional  truncated  Fourier  data.  The  jumps  in  test  function 
(1.1)  are  detected  using  21V+1  Fourier  modes,  with  N  =  32  and  a  reconstruction  grid  of  250  points. 


mollifiers  was  discussed.  They  proposed  to  approximate  [/]  by  a  sequence  of  smooth  pulses, 
Q%(x)  =  £  [f](OaN(x  —  £);  where  K,  is  the  set  of  jump  locations  of  /.  For  increasing  IV,  QaN 
£e/c 

is  increasingly  concentrated  at  the  jumps  and  a  is  drawn  from  an  appropriate  class  of  functions 
so  as  to  ensure  Q ^  has  no  oscillations.  Further,  it  was  shown  that  a  mollified  Fourier  derivative 
operator  of  the  form 

TN[<J\n](x)  =  £  k^TN{k)f{k)e2mkx  (1.5) 

|fc|  <N 

converges  uniformly  to  QaN  for  suitable  choice  of  a  and  sequence  A at.  A  representative  result 
of  this  method  is  shown  in  Fig.  lb,  confirming  the  oscillation-free  approximation  qualities 
of  T/v[<tajv]-  The  edge  detector  function  (1.4)  applied  to  T/v[< 7\N]  now  contains  no  spurious  re¬ 
sponses  as  was  the  case  in  Fig.  la.  We  note  that  the  jump  approximation  (1.5)  is  a  specialization 
of  the  more  general  class  of  concentration  edge  detectors  first  introduced  by  Gelb  and  Tadmor 
in  [7,8]  and  refined  in  [9-11].  These  methods  generally  begin  with  a  jump  approximation  of  the 
form 


where  to  defines  a  concentration  factor.  The  corresponding  physical-space  concentration  kernels 
are  typically  odd,  suitably  scaled,  smooth  and  oscillatory.  The  oscillatory  nature  of  these  kernels 
makes  it  difficult  to  implement  reliable  edge  detector  functions,  especially  in  the  presence  of 
noise. 

Needless  to  say,  the  same  issues  exist  in  two  dimensions,  as  illustrated  in  Fig.  2,  where  the 
edges  of  a  Shepp-Logan  brain  phantom  are  identified  using  the  Canny  edge  detector.  A  Fourier 
partial  sum  reconstruction  on  a  256  x  256  grid  and  using  50  x  50  Fourier  modes  serves  as  the 
input  to  the  Canny  edge  detector.  Fig.  2b  plots  the  generated  edge  map  while  Fig.  2c  shows 
a  cross-section  at  the  center  of  the  image.  The  identified  edges  and  the  Fourier  reconstruction 
along  this  cross-section  are  plotted  using  dashed  and  solid  lines  respectively.  The  comments  and 
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Shepp-Logan  Phantom  Canny  Edge  Map 


(a)  Shepp-Logan  Phantom  (b)  Edge  Map 

Cross-Sectional  View 


Figure  2:  The  Canny  edge  detector  applied  to  a  50  x  50  mode  partial  sum  Fourier  reconstruction 
of  the  Shepp-Logan  phantom  on  a  256  x  256  grid. 


observations  regarding  the  Gibbs  phenomena  in  Fig.  1  apply  here  too.  Our  objective  in  this 
paper  is  to  extend  the  one-dimensional  framework  introduced  in  [6]  to  the  detection  of  edges 
from  two-dimensional  truncated  and  noisy  Fourier  data. 

It  is  appropriate  at  this  point  to  mention  other  related  approaches  to  this  problem  and 
their  relative  advantages  and  disadvantages.  We  start  with  popular  pixel-space  edge  detectors 
such  as  the  Sobel,  Prewitt  or  Marr-Hildreth  edge  detectors  (see  [1,  Chapter  10]  for  a  review) 
as  well  as  more  specialized  algorithms  such  as  the  Canny  edge  detector  [5].  As  mentioned 
previously,  these  pixel-  space  approaches  suffer  from  the  tendency  to  mistake  Gibbs  oscillations 
for  edges  when  applied  to  Fourier  data.  The  method  proposed  here  is  more  closely  related  to  the 
two-dimensional  concentration  kernel  approaches  discussed  in  [12]  and  [13].  [12]  uses  statistical 
hypothesis  testing  methods  to  distinguish  true  edges  from  Gibbs  oscillations,  while  [13]  uses 
regularized  bump  functions  and  rotation-based  post-processing  operations  to  identify  edges. 
The  main  contribution  of  this  paper  is  the  use  of  a  specific  form  of  spectral  mollifier  (and 
associated  parameters)  as  well  as  a  rigorous  analysis  of  the  same,  demonstrating  the  oscillation- 
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free  nature  of  the  resulting  edge  approximation.  We  note  that  this  framework  can  be  combined 
with  any  other  post-processing  procedures,  including  Canny-type  hysteresis  edge  tracking. 

The  rest  of  this  paper  is  organized  as  follows:  §2  introduces  our  two-dimensional  edge  de¬ 
tection  scheme.  A  rigorous  analysis  examining  the  convergence  of  the  scheme  and  confirming 
the  absence  of  oscillaitons  in  the  approximation  is  presented  §3.  §4  provides  numerical  results, 
including  comparisons  to  pixel  data  methods  such  as  the  Canny  edge  detector  and  existing 
Fourier  data  schemes  such  as  the  concentration  method.  Performance  in  the  presence  of  noise 
is  also  examined.  Some  concluding  remarks  and  future  directions  are  presented  in  §5. 

2  Two  Dimensional  Edge  Detection  using  Gaussian  Mol- 
lifiers 

We  first  give  a  brief  introduction  to  the  problem  of  detecting  edges  from  2-D  Fourier  data. 
Suppose  /  :  IR2  — >•  K.  is  a  piece-wise  smooth  and  compactly  supported  on  [0,  l]2.  We  are  given 
its  finite  Fourier  data:  f(z)  for  z  =  ( Zi,Z2 )  €  Sn  '■=  [— N,  N}2  flZ2,  where 

f(z)=  [  f(x,y)e-2™'xe-2™™dxdy. 

J  (x,y)£R2 

We  would  like  to  identify  all  of  its  discontinuities  in  [0,  l]2  and  the  corresponding  jump  heights 
that  will  be  defined  below. 

We  next  present  some  assumptions  on  the  function  /  and  define  the  jump  heights  at  the 
discontinuities.  We  will  assume  the  set  F  of  all  the  discontinuities  consists  of  a  few  finite  and 
disjoint  smooth  curves.  In  particular,  we  could  write  all  the  discontinuities  in  the  following  two 
ways: 

(aj(y),y),  j  =  1,2, ...,Ny,  ,2/gk, 

and 

(x,aij(x)),  j  =  l,2,...,Mx,  i£l, 

where  Ny  is  a  finite  number  for  all  but  finitely  many  y’ s  and  Mx  is  a  finite  number  for  all  but 
finitely  many  x’s.  We  would  also  assume  Mx  and  Ny  are  uniformly  bounded  for  all  x  G  R. 
and  all  y  G  R.  A  simple  illustration  is  shown  in  Figure  3.  Since  we  assume  the  discontinuities 
are  smooth  curves,  both  ttj(y)  and  aj(x)  are  smooth  functions  locally  by  the  Implicit  Function 
Theorem  for  almost  all  y’s  and  for  almost  all  x’s  respectively.  Let  fx  and  fy  denote  the  partial 
derivatives  of  /  at  points  other  than  the  discontinuities.  Note  that  both  fx  and  fy  are  again 
piece- wise  smooth  with  the  same  discontinuities  of  /.  Let 

[f]i(x,y)  =  f(x+,y)  -  f(x-,y),  and  [f]2(x,y)  =  f(x,y+)  -  f(x,y~),  (x,y)  gM2. 

We  point  out  that  when  they  are  different,  one  of  them  must  be  zero.  In  particular,  [f]i(aij{y),  y)  = 
[fh(aj(y)^y)  for  all  but  y’s  with  a'-(y)  =  0  or  oo.  Consequently,  we  define  the  jump  height 
[f](x,y)  be  either  one  of  them  when  they  are  the  same,  and  the  nonzero  one  if  they  are  differ¬ 
ent. 
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X 


Figure  3:  Edge  Detection  in  two  dimensions  —  Principle 

We  next  introduce  our  edge  detector  by  using  the  spectral  Gaussian  modifiers.  For  A  >  0, 
we  define 

In, x(x,y)  =  -2tt i  ^  f(z)z1e~^e2^x+^  (2.1) 

z(zSn 

and 

•W*,!/)  = -27T*  X]  f(z)z2e~^e2^x+Z2y\ 

z£S]\j 

We  will  use  the  following  function  to  detect  the  edges  of  /: 

En, \(x,  y )  =  y)  +  y)] 1/2  sgn(IN,x(x,  y)),  (. x ,  y)  €  K2,  (2.2) 

where  sgn(t)  =  1  if  t  >  0  and  sgn(f)  =  —  1  otherwise. 

We  remark  that  without  the  Gaussian  mollifier  (i.e. ,  A  =  oo),  In,\(x,  y)  and  Jn,\{x,  y)  would 
reduce  to  the  partial  derivatives  of  f(x),  which  would  yield  spikes  at  the  edges  in  addition  to 
non-physical  Gibbs  oscillations  in  their  vicinity.  We  will  show  in  next  section  that  with  suitably 
chosen  A,  the  function  Enfi{x,y)  in  (2.2)  is  a  robust  and  accurate  edge  detector. 

3  Convergence  Analysis 

We  will  present  in  this  section  the  convergence  analysis  of  the  edge  detector  EN,\(x,y)  in  (2.2). 
Specifically,  we  will  present  how  to  choose  the  parameter  A  such  that 

(1)  When  (x,y)  is  away  from  the  edge  curves  of  /,  the  value  of  Ex,\{x,y)  is  close  to  zero. 

(2)  When  (x,y)  is  on  the  edge  curves  of  /,  the  value  of  Ejsrt\(x,y)  is  an  approximation  of  the 
jump  height  [f](x,y). 
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(3)  The  function  Em,\(x.  y )  behaves  like  sharp  “mountains”  rather  than  some  oscillated  peaks 
around  the  edges.  That  is,  the  Gibbs  oscillation  is  controlled. 

(4)  The  edge  detector  Ex,x{x,y)  is  robust  with  respect  to  small  perturbations/noises  on  the 
spectral  data  f(z). 

We  point  out  that  the  convolution  of  /  and  a  Gaussian  function  is  an  important  resource 
for  locating  the  edges  of  /,  since  the  partial  derivatives  of  the  convolution  would  show  some 
singular  behaviors  (sharp  “mountains”)  around  the  edges.  To  this  end,  we  would  define  this 
convolution  and  study  the  relation  between  its  partial  derivatives  and  our  edge  detector.  We 
consider  the  following  Gaussian  function 

<f>(x,  y)  =  7re”’r2(a:2+'y2),  (a:,  y)  G  R2, 

and  for  A  G  R,  let 

<t>x{x,y)  =  \2(j)(\x,  Ay),  (x,y)eR2.  (3.1) 

We  then  convolve  /  with  <j>\\ 


F\(x,y)  =  (f*<f>\)(x,y)=[  f{s,t)<t>x(x- s,y-t)dsdt,  (ij)Gl2.  (3.2) 

J(s,t)e  R2 

We  will  next  focus  on  deriving  estimates  of  the  edge  detector  In,x-  The  estimates  of  Jn, a 
could  be  obtained  in  a  similar  way.  We  shall  first  show  a  relation  between  In,  a  and  the  partial 
derivative  To  this  end,  for  (x,y)  G  R2  we  let 


Qx(x,y) 


E 

z6  Z2 


dFx{x  +  z1:y  +  z2) 
dx 


(3.3) 


and 

BN(x,y)  =  2m  ^  f{z)Zle~^e2^ZlX+Z2y).  (3.4) 

zez2\s„ 

Proposition  3.1  For  (x,y)  G  M2,  there  holds 


In,x(x,  y)  =  Qx(x,  y)  +  BN(x,  y). 

Moreover,  there  exists  a  positive  constant  c  such  that  for  any  ( x ,  y)  G  R2  and  N  G  N 

3  yv 

\IN,\(x,y)  -  Qx{x,y)\  <  cA3e~«^. 


Proof:  We  wrill  prove  the  equality  by  a  direct  computation.  To  this  end,  we  define  the  shift  of 
the  partial  derivatives 


9(x,y)  (si  t) 


dF\(x  +  s,y  +  t) 
dx 


(s,  t)  G  R2 . 
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It  follows  that 


Q\(x,y)  =  9{x,y){z). 

zGZ2 

On  the  other  hand,  a  direct  computation  yields  that  g(x,y)(£)  =  —  27r*/(£)£1e~  ^2  e2™(£ix+&y) . 
By  the  Poisson  summation  formula, 

Qx(x,y)=  £$(*,„)(*)  =  Y.  -27 Tif(z)^e-^e2^+z^ 

zGZ2  zGZ2 

This  combined  with  the  definition  of  In, p  in  (2.1)  and  the  definition  of  Bn  in  (3.4)  implies  the 
desired  equality. 

We  next  show  the  inequality.  It  is  enough  to  show  Bn(x,i/)  is  bounded  by  the  right  hand 
side  of  the  inequality.  Since  /  €  L2[0, 1],  there  exists  a  positive  constant  c0  such  that  \f(z)\  <  c0. 
It  follows  from  (3.4)  that 

V — A  _  llzl|2  ^ — A  _  Z1  ^ — A  _  s2 

\BN(x,y)\  <  2ttc0  2^  2ie  =  2ttc0  >  y  ^e  ^  e 

zez2\sw  zi>w  z2>n 

2  9  9 

^  Z1  QQ  _  _tf_  ,2  _  Nz 

It  is  direct  to  observe  that  z2Zi>n  zie  x2  —  Jn  ^  =  -'2  •  Moreover,  by  using 

the  polar  coordinates,  it  follows  from  a  direct  computation  that  J2z2>i ve_33?  —  2  Ae_2W 

Substituting  these  two  estimates  into  the  above  inequality,  we  have 

7T3/2  3JV2 

|5jv(a;,y)|  <  c0A3e_1vvj 

which  implied  the  desired  inequality.  □ 

We  point  out  that  we  could  choose  appropriate  A  depending  on  N  such  that  Bn{x,u)  con¬ 
verges  to  zero  uniformly,  which  avoids  the  Gibbs  oscillation  in  the  edge  detectors.  More  details 
will  be  shown  in  later  results. 

We  shall  next  analyze  the  behavior  of  Q\.  In  particular,  we  will  show  that  Q\  has  peaks  at 
the  edges  by  using  its  relation  with  the  partial  derivative  ■  To  this  end,  we  first  present  a 
direct  computation  of  .  We  let 


h{x,y) 


aj(t),y-  t)dt , 


and 


H\(x,y)=  /  fx{s,t)4>\(x  -  s,y  -t)dsdt. 

GR2 

We  have  the  following  result  of  the  partial  derivative  . 
Proposition  3.2  For  any  (x,y)  £  M2,  there  holds  that 


dFx 

dx 


(x,  y)  =  I\(x,  y)  +  Hx(x,  y). 

8 


(3.5) 


(3.6) 
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Moreover,  if  the  edge  curves  are  at  least  e  away  from  the  boundary  of[ 0,  l]2,  that  is,  \J  (x  —  x*)2  +  (y  —  y*)2  > 
e  for  all  (x,  y)  £  [0,  l]2  with  either  x  £  {0, 1}  ory£  {0, 1}  and  for  all  (x*,y*)  on  the  edge  curves, 
then  there  exists  a  positive  constant  c  such  that  for  (x,y)  €  [0,  l]2 

—  7T2A2 

\Qx(x,y)  -  h(x,y)\  <  cnX2e~7r2x2e~  +  cttA2  —  ^ +  ||/x||oo- 

Proof:  We  first  show  the  equality  about  the  decomposition  of  ■  From  the  definition  of  dF\ 
in  (3.2),  we  have 

jpi(x,y)=  [  f(s,t)^-{x-s,y-t)dsdt. 

A  direct  calculation  of  from  (3.1)  yields  that 


dF\ 

dx 


(x,y)  =  A3 


ites. 


f(s,t)(t>x( \{x 


s),X{y  -t))ds 


dt. 


Note  that  for  f  £  1,  /(•,<)  has  discontinuities  otj(t)  for  1  <  j  <  Nt.  For  simplicity  of  presenta¬ 
tion,  we  let  «o (t)  —  — oo  and  apft+i(t)  =  oo.  It  follows  that 


9FX  t  f  i3 

sr[x’y)  =  x 


teR 


E 


.  j~i 

Apply  integration  by  parts  and  we  have 


./(s,  t)<f>x( X(x  -  s),  X(y  -  t))ds 


dt. 


dF\,  ,  ,  2 


r  Nt+1 


'teR  L 


-  E  /(M)</>(A(z-s),A(y-£)) 

i=i 


OLj  (t)  /• 

+  /  fx(s,t)(j)(X(x-s),X{y-t))ds 

Oij  —  i(t )  •'K 


dt. 


The  desires  equality  follows  from  a  direct  calculation  from  the  above  equality. 

We  next  estimate  the  difference  of  Q\(x,  y )  and  I\(x,  y).  It  follows  from  the  definition  of  Q\ 
in  (3.3)  and  the  equality  shown  above  that 


\Qx{x,  y)  -  h(x,y)\  <  ^  \Ix{x  +  zu  y  +  z2)\  +  ^  \Hx(x  +  zlt  y  +  z2)\.  (3.7) 

z/o  zez2 

We  will  estimate  the  two  terms  in  the  right  hand  side  of  the  above  inequality  separately. 

We  start  with  an  estimate  of  first  term.  Note  that  both  Nt  and  [f]i(Xj(t),t)  are  uniformly 
bounded  for  all  t.  It  follows  from  the  definition  of  I\  in  (3.5)  that  there  exists  a  positive  constant 
Co  such  that 

\h(x  +  z1,y  +  z2)\  <  coy^  /  <j)x(x  +  z1-aj(t),y  +  z2-t)dt. 

z=£0  z^0,'R 

Note  that  when  z  ^  0,  the  point  (a:  +  Z\,y  +  z2)  is  not  in  [0,  l]2.  By  assumption,  it  is  at 
least  e  away  from  the  edge  curves.  In  particular,  when  z  £  E  :=  { — 1,  0, 1}2\(0, 1),  we  have 
((x  +  Z\  —  ol j  (t))2  +  (y  +  ~2  —  t)2)  >  e.  On  the  other  hand  side,  when  \zi\  >  2,  the  point 

(x  +  z\,y  +  Z2)  is  at  least  \zi\  —  1  away  from  the  edge  curves.  When  z2|  >  2,  the  point 
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(x  +  Zi,  y  +  z2)  is  at  least  \z2\  —  1  away  from  the  edge  curves.  Substituting  these  estimates  into 
<)> a  as  in  (3.1)  yields  that 


Y  \h{x  +  zi,y  +  z2)\ 

z^O 

<  Y^x^x  +  Zl’y  +  Z2^+  \h{x  +  zi,y  +  z2)\+  Y  \h(x  +  zi,y  +  z2)\ 

Zl(EZ,|,Z2  |>2  \zi  \>2,Z2(zZ 


zeE 

<  5c07rA2e_7r  A  e  +  4c07tA2 


1  _  e— 2a2  1  ^  e— 2A2  ’ 
which  combined  with  (3.7)  implies 


| Q\(x,y)  -  I\{x,  y)\  <  5c07rA2e 


r, —  7r  A2 


4co7rA2 


(1-e— 2*2)2 


+  Y  \H\(x  +  zi,y  +  z2)\. 


To  show  the  desired  result  on  \Q\(x,  y)  —  I\{x ,  y) |,  it  remains  to  prove  X^zgz2  \H\(x  +  zi,y  + 
-22)1  <  ||/x||oo-  Note  that  fx{s,t)  =  0  when  ( s,t )  ^  [0,  l]2-  It  is  direct  to  observe  from  the 
definition  of  H\  in  (3.6)  that  for  any  z  £  Z2, 


\Hx(x  +  z1:y  +  z2)\  = 


'(s,t)e[o,i]2 


fx(s,  t)ct>\{x  +  Zi  -  s,y  +  z2  -  t)dsdt 


<  \\fx\\oojt 


(s,t)e[o,i]2 


<t>\(x  +  Zi  -  s,  y  +  z2  —  t)dsdt , 


It  implies 


Y  \H^(X  +  zi,y  +  z2)\  <  HMloo  V  /  <f>x(x  +  zi~s,y  +  z2~t)dsdt 

zgZ2  •I(s,t)6[0,l]2 


*ez2 


||/x||oo  /  <j>\ (u,v)dudv 

J  (u,d)GI2 

ll/xlloo, 


which  finishes  the  proof. 


□ 


We  will  continue  with  the  analysis  of  I\.  In  particular,  we  will  show  that  it  is  concentrated 
around  the  edges  of  /. 

Proposition  3.3  (i)  When  ( x,y )  is  at  least  e  away  from  the  edges,  that  is,  dist{{x,y),T)  >  e, 

there  exists  a  positive  constant  c  such  that 


\h(x,y)\  <  cA  e 


2-ttz\z 


(ii)  When  (x*  ,y*)  is  on  the  edge,  that  is,  x*  =  aj0(y*)  for  some  jo  G  N7  if  there  exists  a  e  >  0 
such  that  d((aj(y),  y),  ( ajo(y*),y *))  =  y/{aj{y)  -  ajo(y*))2  +  {y  ~  V*)2  >  e  for  all  j  ±  j0 


10 


37 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


and  y  £  [0, 1],  then  there  exists  a  positive  constant  c  such  that 


1  +  (aj0(y*))2 


<  c(Xe~^x^2  +  X2e~”2x2*2  +  (Ae)2  +  (Ae)4). 


Proof:  (i)  Note  that  both  Nt  and  the  jump  heights  t)  are  uniformly  bounded.  Since 

/  is  compactly  supported  on  [0, 1],  it  follows  from  the  definition  of  tl\  in  (3.5)  that  there  exists 
a  positive  constant  Cq  such  that 


\h(x,y)\  <  c0  /  <t>\(x-  otj{t),y  -t)dt. 

Jo 


By  the  definition  of  <j>\  in  (3.1), 


y)\  <  c0  f1  \2e-*2*2({*-<*i(t))2Hy-t)2)dt' 

Jo 


When  dist((x ,  y),  T)  >  e,  that  is,  (x  —  otj{t))2  +  (y  —  t)2  >  e2  for  all  t  G  [0, 1], 


\h(x,y)\  <  c0 A2e 


(ii)  By  (3.5)  we  have  that 


,  Nt 

h(x*,y*)=  /  ^2[f]i{aj(t),t)^x(o‘j0(y*)  -  otj(t),y*  -  t)dt. 

JR 


It  follows  from  the  same  argument  in  (i)  that 


f  Nt 

/  ~otj{t),y*  -t)dt  <  Co A2e"’r2 

J\t-r\>^x 


[  [/] i(ai (*)»*)  <Mayo (V*)  ~  -  t)dt  <  c0AV 

JR  :  /.• 


These  two  inequalities  combined  with  (3.8)  yields  that 


h(x*,y*)-IxAx*,y*)  <2c0A2e- 


where 


hAx*>y*)  =  /  [fhAjoiOiO&Aajoiy*)  ~  aj0(t),y*  -t)dt. 

J\t-y*  |<« 


We  next  estimate  the  integral  I\,e(x*,  y*).  Since  both  [/] i  and  oy0  are  smooth  functions 
locally,  there  exist  positive  constants  ci  and  C2  such  that  for  \t  —  y*\  <  e 


[/]i  (“jo  (*)>*)  ~  [f}iAj0(y*),y*)  <ci\t-y* 
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and 


(3.12) 


\<Xj0(t)  -  «io (V*)  ~  a'jo(y*)(t  -  y*)\  <  c2\t  -  y*\2 . 

By  the  definition  of  (j>\  in  (3.1), 

4>x (aj0(y*)  -  oij0 (t),y*  -  t)  =  ,A2e-l2A2[(“-»(»*)-“io(i))2+(‘-»,)2I. 

Since  oy0  is  smooth,  there  exists  a  positive  constant  C3  such  that  \otj0{t)  —  ctj0(y*)  +  cr'o ( y*)(t  — 
y*) |  <  c3\t—y*\.  This  combined  with  (3.12)  yields  that  \(&j0{t)— otj0(y*))2— {oJj0{y*))2{t— y*)2\  < 
C2C3 1 ^  —  y*|.  Substituting  it  into  the  above  equality  and  putting 

^x(t)  =  ttAV^+K^*))2]^*)2,  (3.13) 


we  have 

4>X (&j0{y*)  -  ajo(t),y*  -t)-  ipx{t) 
Since  1  —  e~x  <  x  for  x  >  0, 


<^(t)(l-eVAV3|t-9‘|3). 


<l>x(ajo(y*)  -  aj0(t)i  y*  -t)~  ^x(t) 


<  ipx(t)n2X~c2c3\t  -  y 


*  |3 


We  can  now  estimate  Ix,e(x*,y*)  as  in  (3.10)  following  from  the  above  inequality  and  (3.11) 


,e(x*,y*)~  /  [f]i{oi30{y*),y*)^\{t)dt 

J\t-y*\<e 


< 


([f]i(ajo{t),t)  -  [f]i(ajo(y*),y*))ipx(t)dt 

\t-y*\<e 

I  [f]i(oijo{y*),y*)(<l>x(oij0(y*)  -ajo{t),y*  -t)  -ipx{t))dt 

J\t-y*\<e 


<  Cl 


'  t  !/“;<< 


\t  -  y*\ipx(t)dt 


+  c2c3\[f}i{ajo(y*),y*)\ 


[  ipx(t)Tr2\2\t-y*\3dt 
■'  1  v‘<< 


It  is  direct  to  observe  from  (3.13)  that  ip\(t)  <  n\2e~7r2x2^t~v  ^ .  Moreover,  there  exists  a 
positive  constant  C4  such  that  c2c3\[f]i(aj0(y*),y*)  <  C4  for  all  y*  G  [0,1].  Substituting  these 
into  the  above  inequality  and  having  a  change  of  variable  u  =  t  —  y*  yields  that 


h,e{x* ,y*)-  /  [f]i(ajo{y*),y*)%px{t)dt 

J\t-y*\<e 


<  2ci7tA2  /  ue  n  x  u  du+2c4ir3\4  /  u3e  n  x  u  du. 

Jo  Jo 


Since  e  *  x  u  <1  for  all  u  >  0,  it  follows  from  a  direct  computation  of  the  above  integrals  that 


Ix,e(x*,y*)  —  /  [f]1(ajo(y*),y*)ipx{t)dt 


<  ci7r(Ae)2  +  -C47r6(Xey. 


(3.14) 
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We  next  estimate  the  integral  in  the  above  inequality.  To  this  end,  we  let  F(a)  =  /  "  e  x  dx. 
A  direction  computation  from  (3.13)  gives  that 

=  A  /  l  "A'/l  +  im'  .i  ,-  ;  ;-, 

^/i  +  K.lr))2  v 

Note  that  by  using  the  polar  coordinates  in  the  integral,  we  have  the  following  estimates  of 
F(a):  7r(l  —  e_“2)  <  F'2(a)  <  7r(l  —  e_2“2),  which  implies  \F(a)  —  y^l  <  y/ne~a2 .  Substituting 
it  into  the  above  equation,  we  have 


l\t-y*\<e 


ip\(t)dt  - 


1  +  KJl/*))2 


< 


2U +(«'0(y*))2)AV  <  ^?rAe 


—  7TZXZ 


1  +  Kn(y*))2 


Since  [/]i  is  continuous,  there  exists  a  positive  constant  c5  such  that  \[f]i{x* ,y*)\  <  c5  for  all 
y*  €  [0, 1].  It  implies 


l\t-y*\<e 


[/] ( t ) dt  -  [/]  1  (ajo(y*),y*) 


y/1  +  K>*))2 


<  csv^Ae 


—  7T^ 


The  desired  result  follows  from  this  combined  with  (3.9)  and  (3.14).  □ 

We  remark  that  we  could  choose  appropriate  A  and  e  such  that  I\(x,y)  will  be  arbitrarily 
small  when  the  point  ( x,y )  is  away  from  the  edge  curves  and  it  will  blow  up  when  the  point 
(x,  y)  is  on  the  edge  curve.  That  is,  I\{x,y)  behaves  like  a  “sharp  mountain”  around  the  edge 
curves.  We  will  present  the  specific  choices  of  A  and  e  in  the  later  results. 

We  are  now  ready  to  present  the  edge  detection  behavior  of  In,\- 

Theorem  3.4  (i)  When  ( x,  y )  is  at  least  e  away  from  the  edges,  that  is,  dist((x,y),T)  >  e, 

there  exists  a  positive  constant  c  such  that 


lN,\{x,y) 

v^A 


-f-  Ae 


+  A 


(l_e— 2a2)2 


(ii)  When  (x*,  y*)  is  on  the  edge,  that  is,  x*  =  ay0  ( y *)  for  some  jo  €  N,  if  there  exists  a  e  >  0 
such  that  d((aj(y),y),  (aj0(y*),y*))  >  e  for  all  j  jo  and  y  €  [0,1],  then  there  exists  a 
positive  constant  c  such  that 


In,\(x*  ,y*) 
a/ttA 


lMx*,y*) 

\A + 


<  c 


(l_e— 2a2)2 


II  Ml  oc 
A 


+(A+l)e_7r2A2c2+Ae2 


Proof:  It  follows  immediately  from  Propositions  3.1,  3.2,  and  3.3. 


□ 


Similarly,  we  could  obtain  the  following  estimates  on  Jn,\- 
Theorem  3.5  (i)  When  ( x,y )  is  at  least  e  away  from  the  edges,  that  is,  dist{{x,  y),  T)  >  e, 
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there  exists  a  positive  constant  c  such  that 


x/ttX 


,  *  ■>  _  3 N±  2  \  2  2 

<  c  A2e  +  Ae  A  e  +  A 


~  — 7W  A 


(l_e— 2a2)2 


ll/ylloo^j 


(n)  W/ien  (x*,y*)  is  on  the  edge,  that  is,  y*  =  ai0(x*)  for  some  lg  £  N,  if  there  exists  a  e  >  0 
such  that  d((x,ai(x)),(x*,ai0(x*)))  >  e  for  all  l  lg  and  x  £  [0,1],  then  there  exists  a 
positive  constant  c  such  that 


[fh(x*,y*) 


1  +  K>*))2 


<  c  A  e 


i2„-i 


--7T  A 


■+A 


II/, 


2/lloo 


(1  -e"7^2)2 


+  (A+l)e 


+Ae  +A  e 


3,4 


Proof:  It  follows  immediately  from  Propositions  3.1,  3.2,  and  3.3.  □ 

Consequently,  we  will  present  the  main  result  of  this  paper  below.  In  particular,  we  will 
show  the  specific  choices  of  A  and  e  such  that  the  edge  detector  E^, a  as  in  (2.2)  behaves  like 
oscillation-free  sharp  “mountains”  around  the  edges. 

Theorem  3.6  If  A  =  eg  lo^N  and  e  =  C\  ( ) ~p  for  some  positive  constants  Cg,Ci  and  |  < 
p  <  1,  then  for  large  enough  N, 

(i)  when  (x,y)  is  at  least  e  away  from  the  edges,  that  is,  dist((x,  y),T)  >  e,  there  exists  a 
positive  constant  c  such  that 

I  771  /  x,  ^  log  AT 

\EN,x{x,y)\  <  c—^\ 

(ii)  when  (x*,y*)  is  on  the  edge,  that  is,  x*  =  atj0(y*)  and  y*  =  ai0(x*)  for  some  jg,lg  £  N, 
if  there  exists  a  e  >  0  such  that  d((aj(y),  y),  (aj0(y*),  y*))  >  e  for  all  j  y^  jg  and  y  £  [0, 1] 
and  d((x,ai(x)),  (x*,ai0(x*)))  >  e  for  all  l  y^  lg  and  x  £  [0, 1],  then  there  exists  a  positive 
constant  c  such  that 

,  /loglV\4p_3 
\En,\(x* ,y*)  —  [f](x*,y*)\  <  c  • 


Proof:  (i)  It  follows  from  a  direct  computation  from  substituting  the  choices  of  A  and  e  into 

Theorems  3.4,  3.5  and  the  definition  of  Ejy  \  in  (2.2). 

(ii)  Note  that  when  x*  =  ctj0(y*)  and  y*  =  a i0(x*),  we  have  [f](x*,y*)  =  [f]i(x* ,y*)  = 

2  /  \  2 

=  1.  The  desired  result  follows  immedi- 


[f^x*,y*)  and  Wl+Ko(r))_,y  -T- 
ately  from  a  direct  computation  from  substituting  the  choices  of  A  and  e  into  Theorems  3.4,  3.5 
and  the  definition  of  E^  a  in  (2-2).  □ 


4  Numerical  Results 

We  now  present  numerical  results  demonstrating  the  accuracy  of  the  proposed  formulation. 
Matlab  code  used  to  generate  the  figures  in  this  section  can  be  found  at  [14].  We  begin  with 
Figure  4,  where  we  plot  the  edge  map  of  a  Shepp-Logan  phantom  on  a  256  x  256  grid  given  its 
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Figure  4:  Edge  Detection  —  Shepp-Logan  Phantom;  Sn  =  [ — 50,  50]2  ClZ2  while  the  equispaced 
reconstruction  grid  is  of  size  256  x  256  in  [0,  l]2. 


15 


42 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


first  50  x  50  Fourier  modes.  Low-resolution  measurement  acquisitions  such  as  this  are  common 
in  MR  imaging  applications  For  reference,  the  partial  Fourier  sum  reconstruction  (showing 
significant  Gibbs  oscillations)  is  plotted  in  Figure  4a.  Applying  the  proposed  spectral  modifier, 
the  resulting  jump  funtion  approximation  is  shown  in  Figure  4b,  while  the  resulting  edge  map 
is  shown  in  Figure  4c.  Hysteresis  edge  tracking  (similar  to  that  implemented  in  the  Canny  edge 
detector)  was  used  to  obtain  Figure  4c  from  Figure  4b.  For  comparison,  we  also  plot  in  Figure 
4d  the  results  of  applying  the  standard  Canny  edge  detector.  Note  the  presence  of  a  significant 
number  of  false  positives  (see  also  Figures  4e  and  4f  for  cross-section  plots)  —  these  are  due  to 
the  Gibbs  oscillations  being  spuriously  identified  as  edges  by  the  Canny  algorithm.  Finally,  we 
note  that  the  proposed  method  also  provides  approximations  to  the  jump  height  (as  illustrated 
in  Figure  4e)  which  may  be  useful  in  certain  applications  such  as  the  solution  of  PDEs. 

Next,  we  present  a  higher  resolution  example  in  Figure  5,  where  the  edges  in  the  Shepp- 
Logan  are  identified  starting  with  the  first  200  x  200  Fourier  modes.  As  before,  the  results  are 
plotted  on  a  256  x  256  equispaced  grid.  Figure  5a  plots  the  Fourier  partial  sum  reconstruction 
for  reference  while  Figure  5b  plots  the  jump  function  approximation.  Figures  5c  and  5d  plot  the 
edge  maps  generated  by  the  proposed  method  and  the  Canny  edge  detector  respectively,  while 
Figures  5e  and  5f  show  the  corresponding  cross-section  plots.  In  this  case,  Gibbs  oscillations  in 
the  Fourier  reconstruction  are  localized  to  regions  close  to  the  true  edge  locations.  Moreover,  the 
standard  Canny  edge  detector  does  a  good  job  of  recognizing  and  suppressing  spurious  Gibbs 
oscillations  from  true  edges.  However,  note  that  some  of  the  closely  spaced  edges  are  either 
missing  or  spuriously  identified  by  the  Canny  edge  detector  (see  the  cross-  section  plots  for  an 
illustration),  while  the  proposed  method  accurately  identifies  these. 

Figures  4  and  5  have  illustrated  the  performance  of  the  method  when  we  have  perfect  (noise¬ 
less)  measurements.  We  now  consider  the  case  where  the  Fourier  modes  are  corrupted  by 
additive  (complex)  Gaussian  noise;  i.e. , 

g( z)  =  /(z)  +  n(z),  z  =  (zlyz2)  G  SN  :=  [-N,  N}2  n  Z2, 

where  /  an  g  denote  the  true  and  noise  corrupted  Fourier  coefficients  respectively,  and  n  denotes 
additive  noise  in  Fourier  space.  In  Figure  6,  the  first  50  x  50  Fourier  modes  of  the  Slrepp-Logan 
phantom  are  corrupted  by  i.i.d.  additive  complex  Gaussian  noise  of  variance  =  2  x  10~4. 
The  equivalent  PSNR  is 

_ max.  j  \f(xj,  yj)\ _ 

M.-lMj-l 

jtw  E  E  [SNf(xi,yj)  -  SNg(xi,yj)}2 

M*mv  i= o  j=o 

where  Mx ,  My  are  the  number  of  points  in  the  reconstruction  grid  (Mx  =  My  =  256  in  Figure 
6)  and  5 at/,  Sng  are  the  Fourier  partial  sum  reconstructions  of  /  and  g  respectively: 

SNf(x,y)  =  J2  f(z)e2^x+™\  SNg(x,  y)  =  ]T  g(z)e2^z'x+™\ 

zGSn  zGSn 

As  before  the  jump  function  approximation,  edge  maps  using  the  proposed  method  and  the 
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Figure  5:  Edge  Detection  —  Shepp-Logan  Phantom;  Sn  =  [—200,  200]2  n  Z2  while  the  equispaced 
reconstruction  grid  is  of  size  256  x  256  in  [0,  l]2. 
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Figure  6:  Noisy  Edge  Detection  —  Shepp-Logan  Phantom;  Sn  =  [ — 50, 50]2  ClZ2  while  the  equis- 
paced  reconstruction  grid  is  of  size  256  x  256  in  [0,  l]2.  Additive  complex  white  Gaussian  noise  of 
variance  2  x  ICC4  (PSNR  36.93  dB)  was  added  to  the  Fourier  modes. 
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Canny  edge  detector,  and  the  cross  sections  of  the  edge  maps  are  shown  in  Figures  6a  -  6f 
respectively.  We  observe  that  the  addition  of  noise  to  pre-existing  Gibbs  oscillations  results 
in  the  Canny  edge  detector  generating  numerous  spurious  edges,  while  the  proposed  method 
suppresses  almost  all  of  these  artifacts  and  generates  a  near-perfect  edge  map. 

5  Concluding  Remarks 

In  this  paper,  we  have  introduced  a  class  of  spectral  mollifiers  for  the  detection  of  edges  from 
two-dimensional  truncated  Fourier  data.  Recall  that  the  problem  of  detecting  edges  from  Fourier 
spectral  data  is  different  from  and  more  challenging  than  the  problem  of  detecting  edges  from 
pixel  data.  Indeed,  distinguishing  between  true  edges  and  Gibbs  oscillations  is  a  non-trivial  task, 
especially  when  we  start  with  a  small  number  of  (possibly  noise  corrupted)  Fourier  coefficients. 
We  have  shown  through  rigorous  analysis  that  the  jump  approximations  generated  using  the 
proposed  spectral  modifier  are  guaranteed  to  be  free  of  spurious  oscillations  and  edges.  Numer¬ 
ical  results  show  that  the  resulting  edge  maps  are  accurate  and  outperform  standard  methods 
such  as  the  Canny  edge  detector,  especially  in  cases  where  we  have  truncated  and/or  noisy  data. 

Several  interesting  avenues  for  future  research  exist,  including  the  extension  of  these  results 
to  the  case  of  non-harmonic  Fourier  data,  investigation  of  the  performance  of  this  method  for 
highly  incomplete  or  interrupted  data,  and  the  extension  of  the  method  to  the  case  of  distributed 
data  acquisition. 
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RANDOM  MATRICES  AND  ERASURE  ROBUST  FRAMES 

YANG  WANG 


Abstract.  Data  erasure  can  often  occur  in  communication.  Guarding  against  erasures 
involves  redundancy  in  data  representation.  Mathematically  this  may  be  achieved  by 
redundancy  through  the  use  of  frames.  One  way  to  measure  the  robustness  of  a  frame 
against  erasures  is  to  examine  the  worst  case  condition  number  of  the  frame  with  a  certain 
number  of  vectors  erased  from  the  frame.  The  term  numerically  erasure-robust  frames 
(NERFs)  was  introduced  in  [9]  to  give  a  more  precise  characterization  of  erasure  robustness 
of  frames.  In  the  paper  the  authors  established  that  random  frames  whose  entries  are 
drawn  independently  from  the  standard  normal  distribution  can  be  robust  against  up  to 
approximately  15%  erasures,  and  asked  whether  there  exist  frames  that  are  robust  against 
erasures  of  more  than  50%.  In  this  paper  we  show  that  with  very  high  probability  random 
frames  are,  independent  of  the  dimension,  robust  against  any  amount  of  erasures  as  long 
as  the  number  of  remaining  vectors  is  at  least  1  +  <5  times  the  dimension  for  some  do  >  0. 
This  is  the  best  possible  result,  and  it  also  implies  that  the  proportion  of  erasures  can 
arbitrarily  close  to  1  while  still  maintaining  robustness.  Our  result  depends  crucially 
on  a  new  estimate  for  the  smallest  singular  value  of  a  rectangular  random  matrix  with 
independent  standard  normal  entries. 


1.  Introduction 

Let  H  be  a  Hilbert  space.  A  set  of  elements  T  =  { f n }  in  H  (counting  multiplicity)  is 
called  a  frame  if  there  exist  two  positive  constants  C*  and  C*  such  that  for  any  v  G  H  we 
have 

(1-1)  a||v||2<Xl<Ufn>|2<<?lv||2. 

n 

The  constants  (7*  and  C*  are  called  the  lower  frame  bound  and  the  upper  frame  bound , 
respectively.  A  frame  is  called  a  tight  frame  if  (7*  =  C* .  In  this  paper  we  focus  mostly  on 
real  finite  dimensional  Hilbert  spaces  with  H  =  Mn  and  T  =  {fn}yLi,  although  we  shall 
also  discuss  the  extendability  of  the  results  to  the  complex  case.  Let  F  =  [fi,  f2,  •  ■  • ,  fiv]-  It 
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is  called  the  frame  matrix  for  T .  It  is  well  known  that  T  is  a  frame  if  and  only  if  the  n  x  N 
matrix  F  has  rank  n.  Furthermore,  the  optimal  frames  bounds  are  given  by 

C*=a2n(F),  C*=a2(F), 


where  <j\  >  02  >  •  •  •  <Jn  >  0  are  the  singular  values  of  F.  Throughout  this  paper  we  shall 
identify  without  loss  of  generality  a  frame  T  by  its  frame  matrix. 


The  main  focus  of  the  paper  is  on  the  erasure  robustness  property  for  a  frame.  This 
property  arise  in  applications  such  as  communication  where  data  can  be  lost  or  corrupted 
in  the  process  of  transmission.  Suppose  that  we  have  a  frame  F  that  is  full  spark  in  the 
sense  that  every  n  columns  of  F  span  Rn,  it  is  theoretically  possible  to  erase  up  to  N  —  n 
data  from  the  full  set  of  data  {(v,  while  still  reconstruct  the  signal  v.  This  is  a 

simple  consequence  of  the  property  that  with  the  remaining  available  data  {(v,  f. j)}j£S  with 
1 5 1  >  n,  v  is  uniquely  determined  because  spanffj  :  j  E  S')  =  Mn.  In  practice,  however, 
the  condition  number  of  the  matrix  [f)]  jeS  could  be  so  poor  that  the  reconstruction  is 
numerically  unstable  against  the  presence  of  additive  noise  in  the  data.  Thus  robustness 
against  data  loss  and  erasures  is  a  highly  desirable  property  for  a  frame.  There  have  been 
a  number  of  studies  that  aim  to  address  this  important  issue. 


Among  the  first  studies  of  erasure-robust  frames  was  given  in  [10].  It  was  shown  in 
subsequent  studies  that  that  unit  norm  tight  frames  are  optimally  robust  against  one  erasure 
[?]  while  Grassmannian  frames  are  optimally  robust  against  two  erasures  [16,  11].  The 
literature  on  erasure  robustness  for  frames  is  quite  extensive,  see  e.g.  also  [12,  18,  13].  In 
general,  the  robustness  of  a  frame  F  against  g-erasures,  where  q  <  N  —  n,  is  measured  by 
the  maximum  of  the  condition  numbers  of  all  n  x  (IV  —  q )  submatrices  of  F.  More  precisely, 
let  5  C  {1,  2, . . . ,  IV}  and  let  Fg  denote  the  n  x  |5|  submatrix  of  F  with  columns  fj  for  j  G  S 
(in  its  natural  order,  although  the  order  of  the  columns  is  irrelevant).  Then  the  robustness 
against  g-erasures  of  F  is  measured  by 


(1.2) 


R(F,q )  :=  max 


<ri(Fs) 


| S\=N-q  an(Fs)' 


Of  course,  the  smaller  R(F ,  g)  is  the  more  robust  F  is  against  g-erasures.  In  [9],  Fickus  and 
Mixon  coined  the  term  numerically  erasure  robust  frame  (NERF).  A  frame  F  is  ( K,a,(3)~ 
NERF  if 


a  <  crn(Fs)  <  01  (Fs)  <  /3  for  any  S  C  {1,  2, ... ,  IV},  151  =  K. 
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Thus  in  this  case  R(F,N  —  K )  <  (3/ a.  Note  that  for  any  full  spark  n  x  N  frame  matrix 
F  and  any  n  <  K  <  N  there  always  exist  a,  (3  >  0  such  that  F  is  (A",  a,  /3)-NERF. 
The  main  goal  is  to  find  classes  of  frames  where  the  bounds  a,  (3 ,  and  more  importantly, 
R(F,N  —  K)  =  /3/ a,  are  independent  of  the  dimension  n  while  allowing  the  proportion  of 
erasures  1  —  ^  as  large  as  possible.  The  authors  studied  in  [9]  the  erasure  robustness  of 
F  =  ^A,  where  the  entries  of  A  are  independent  random  variables  of  the  standard  normal 
jV(0, 1)  distribution.  It  was  shown  that  with  high  probability  such  a  matrix  can  be  good 
NERFs  provided  that  K  is  no  less  than  approximately  85%  of  N.  The  authors  also  proved 
that  equiangular  frame  F  in  Cn  with  N  =  n2  —  n  +  1  vectors  is  a  good  NERF  against 
up  to  about  50%  erasures.  As  far  as  the  proportion  of  erasures  is  concerned  this  was  the 
best  known  result  for  NERFs.  However,  the  frame  requires  almost  n2  vectors.  The  authors 
posed  as  an  open  question  whether  there  exist  NERFs  with  K  <  N/2.  A  more  recent  paper 
[8]  explored  a  deterministic  construction  based  on  certain  group  theoretic  techniques.  The 
approach  offers  more  flexibility  in  the  frame  design  than  the  far  more  restrictive  equiangular 
frames. 

In  this  paper  we  revisit  the  robustness  of  random  frames.  We  provide  a  much  stronger 
result  for  random  frames,  showing  that  for  any  d  >  0,  with  very  high  probability,  the  frame 
F  =  A  is  a  ((1  +  d)n,  a,  /3)-NFRF  where  a,  (3  depend  only  on  d  and  the  aspect  ratio 
One  version  of  our  result  is  given  by  the  following  theorem. 

Theorem  1.1.  Let  F  =  A  where  A  is  n  x  N  whose  entries  are  independent  Gaussian 
random  variables  of  N( 0, 1)  distribution.  Let  A  =  ^  >  1.  Then  for  any  0  <  do  <  A  —  1  and 
to  >  0  there  exist  a,  (3  >  0  depending  only  on  do,  A  and  tq  such  that  for  any  do  <  d  <  A  —  l, 
the  frame  F  is  a  ((1  +  d)n,  a,  (3)-NERF  with  probability  at  least  1  —  e~T°n. 

Later  in  the  paper  we  shall  provide  more  implicit  estimates  for  a,  f3  that  will  allow  us  to 
easily  compute  them  numerically.  Note  that  our  result  is  essentially  the  best  possible,  as 
we  cannot  go  to  do  =  0.  A  corollary  of  the  theorem  is  that  for  random  Gaussian  frames  the 
proportion  of  erasures  1  —  ^  can  be  made  arbitrary  large  while  the  frames  still  maintain 
robustness  with  overwhelming  probability. 

Our  theorem  depends  crucially  on  a  refined  estimate  on  the  smallest  singular  value  of  a 
random  Gaussian  matrix.  There  is  a  wealth  of  literature  on  random  matrices.  The  study 
of  singular  values  of  random  matrices  has  been  particularly  intense  in  recent  years  due 
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to  their  applications  in  compressive  sensing  for  the  construction  of  matrices  with  the  so- 
called  restricted  isometry  property  (see  e.g.  [4,  5,  1,  2]).  Random  matrices  have  also  been 
employed  for  phase  retrieval  [3],  which  aims  to  reconstruct  a  signal  from  the  magnitudes  of 
its  samples.  For  a  very  informative  and  comprehensive  survey  of  the  subject  we  refer  the 
readers  to  [15,  19],  which  also  contains  an  extensive  list  of  references  (among  the  notable 
ones  [7,  14,  17]).  For  the  nx  N  Gaussian  random  matrix  A  the  expected  value  of  <n(A)  and 
an(A)  are  asymptotically  y/N  +  y/n  and  y/N  —  y/n,  respectively.  Many  important  results, 
such  as  the  NERF  analysis  of  random  matrices  in  [9]  as  well  as  results  on  the  restricted 
isometry  property  in  compressive  sensing,  often  utilize  known  estimates  of  <Ji(A)  and  un{A) 
based  on  Hoeffding-type  inequalities.  One  good  such  estimate  is 

(1.3)  P  (crn(A)  <  y/~N  —  y/n  —  t)  <  e~^ , 

see  [19].  The  problem  with  this  estimate  is  that  even  by  taking  t.  =  y/~N  —  y/n  we  only  get  a 
bound  of  e~(vA-i)2n/2  even  though  the  probability  in  this  case  is  0.  Thus  estimates  such  as 
(1.3)  that  cap  the  decay  rate  are  often  inadequate.  When  applied  to  the  erasure  robustness 
problem  for  frames  they  usually  put  a  cap  on  the  proportion  of  erasures.  To  go  further  we 
must  prove  an  estimate  that  will  allow  the  exponent  of  decay  to  be  much  larger.  We  achieve 
this  goal  by  proving  the  following  theorem: 


Theorem  1.2.  Let  A  be  nx  N  whose  entries  are  independent  random  variables  of  standard 
normal  1V(0,1)  distribution.  Let  A  =  ^  >  1.  Then  for  any  p  >  0  there  exist  constants 
c,  C  >  0  depending  only  on  /./  and  A  such  that 


(1.4)  P  ( cy/n  <  an(A)  <  ai(A)  <  Cy/nj  >  1  —  3e  fin. 
Furthermore,  we  may  take  (7  =  1  +  y/\  +  / Ji  and  c  =  sup0<t<1  ip(t)  where 

t'\  2  Ct 

(1.5)  <p(t)  =  —  -  - - where 


t  2e  n 

L  =  \  —e^  . 


Acknowledgement.  The  author  would  like  to  thank  Radu  Balan  and  Dustin  Mixon  for 
very  helpful  discussions. 


2.  Smallest  Singular  Value  of  a  Random  Matrix:  Nonasymptotic  Estimate 

We  begin  with  estimates  on  the  extremal  singular  values  of  a  ranodm  matrix  A  whose 
entries  are  independent  standard  normal  random  variables.  We  shall  assume  throughout 
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the  section  that  A  is  n  x  N  where  —  =  A  >  1.  One  of  the  very  important  estimates  is 


(cri(A)  >  VN  +  y/n  +  t'j  <  e  2  , 


see  [19].  Our  main  goal  of  this  section  is  to  prove  the  estimates  for  smallest  singular  value 
crn(A)  stated  in  Theorem  1.2.  An  equivalent  formulation  of  (2.1)  is 


Observe  that 


.  „  (C-l-y/X)2 

(<?i(A)  >  Cyfn )  <  e  2  n7 


an(A )  =  min  ||A*v| 
veS"-1 


c>i  +  Vx. 


where  Sn  1  denotes  the  unit  sphere  in  Rn. 

Lemma  2.1.  Let  c  >  0.  For  any  v  e  5n_1  the  probability  P  (||  A*v||  <  c)  is  independent  of 
the  choice  ofv.  We  have 


(||A*v||  <  Vdn)  < 


for  any  6  >  0. 


Proof.  The  fact  that  P  (||A*v||  <  c)  is  independent  of  the  choice  of  v  is  a  well  know  fact, 
which  stems  from  the  fact  that  the  entries  of  PA  are  again  independent  standard  normal 
random  variables  for  any  orthogonal  n  x  n  matrix  P.  In  particular,  one  can  always  find  an 
orthogonal  P  such  that  Pv  =  e\.  Thus  we  may  without  loss  of  generality  take  v  =  e\.  In 
this  case  ||A*v||2  =  a2 j  +  •  •  •  +  a^N  where  [an, . . . ,  o\n]  denotes  the  first  row  of  A.  Denote 
Y/v  =  af-L  +  •  •  •  +a“lN.  Then  Yjy  has  the  r(^,  1)  distribution,  which  has  the  density  function 

...  1  1 


p(t)  =  " 


Denote  m  =  y.  It  follows  that 


(||A*v||  <  Vdn)  =  P  (Yn  <  Sn) 


e~ttm~1dt 


tm~ldt 


Smnm 
T(m)  ' 

Note  that  T(m)  >  (™)m  by  Stirling’s  formula.  The  theorem  now  follows  from  ^  =  A  and 
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A  ubiquitous  tool  in  the  study  of  random  matrices  is  an  e-net.  F  or  any  e  >  0  an  e-net 
for  S'"-"1  is  a  set  in  S'""1  such  that  any  point  on  S’""1  is  no  more  than  e  distance  away  from 
the  set.  The  following  result  is  known  and  can  be  found  in  [19]: 

Lemma  2.2.  For  any  e  >  0  there  exists  an  e-net  Ne  in  S'""1  with  cardinality  no  larger 
than  (1  +  2c"1)". 


Proof  of  Theorem  1.2.  Assume  that  an(A)  =  h\Jn.  Then  there  exists  a  vo  £  S'""1  such 
that  1 1  A* vo 1 1  =  bsfn.  Let  J\fe  be  an  e-net  for  S'""1  and  take  u  £  J\f£  that  is  the  closest  to  vo- 
So  || u  —  vq  ||  <  e.  Thus 


(2.4) 
Hence 

(2.5) 


Note  that 


||A*u||  <  ||A*vq||  +  ||A*(u  -  v0) ||  <  by/n  +  eai(A). 
P  (o„(A)  <  cVS)  <  £  P  (||A*u||  <  cyfn  +  ecri(A)). 

ueA4 


P  (||A*u||  <  cyfn  +  e<7i (A))  =  P  (||A*u||  <  cy/n  +  ecri(A),  <ti(A)  <  C\Jn) 

+  P  ( 1 1 A* u 1 1  <  cyfn  +  ecri(A),  cti(A)  >  C\Jn). 
By  Lemma  2.1  the  first  term  on  the  right  hand  side  is  bounded  from  above  by 
P  ( 1 1  A*  u  1 1  <  c\fn  +  e<Ji  ( A) ,  <ji  ( A)  <  C\fn ) 

<  P  ( 1 1  A* u 1 1  <  cy/n  +  eCy/n)  <  ^2e(c+^c)2y 
By  (2.2)  the  second  term  on  the  right  hand  side  is  bounded  from  above  by 

P  (||A*u||  <  c^fn  +  eai(A),  ai(A)  >  C\Jn) 

,  (c-i-Vx)2 

<  P  (<n(A)  >  CV^)  <  e - 2 - ". 


Thus  combining  these  two  upper  bounds  we  obtain  the  estimate 


(2.6) 


(Cn(A)  <  Cy/n )  <  (l  + 


2e(c  +  eC)2\ f  (c-i-Vx)2 

- A - )  +e  2 


We  would  like  to  bound  P  (<rn(A)  <  Cyfn )  by  2e  M".  All  we  need  then  is  to  choose 
e,c,C  >  0  so  that  both  upper  bound  terms  in  (2.6)  are  bounded  by  e"M".  Note  that 
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Y  =  f  n.  Hence  we  only  need 

(2.7)  —  n  >  ln(l  +  2e~1)  +  ^ln2e  —  In  A  +  21n(c  +  eC)^j, 

(2.8)  -p  >  -i(C-l-VA)2. 

The  equation  (2.8)  leads  to  the  condition 

(2.9)  C>  y/2jl+VX  +  l. 

To  meet  condition  (2.7)  we  set  c  =  re.  Then  ln(c  +  eC)  =  —  lne-1  +  ln(r  +  C ).  Thus  (2.7) 
becomes 

(2.10)  (A  -  1)  ln(e_1)  >  p  +  ln(2  +  e)  +  ^  ln(2<?^  ^  ^ 

Clearly,  once  we  fix  C  and  r,  say,  take  (7  =  y/2p  +  \/A  +  1  and  r  =  1,  lne^1  will  be  greater 
than  the  right  hand  side  of  (2.10)  for  small  enough  e  because  of  the  condition  A  >  1.  Both 
C,  c  only  depend  on  A  and  p.  The  existence  part  of  the  theorem  is  thus  proved. 

While  we  have  already  a  good  explicit  estimate  C  =  y/2/I+\/ A  +  l,  it  remains  to  establish 
the  explicit  formula  for  c.  For  any  fixed  r  the  largest  e  is  achieved  when  (2.10)  is  an  equality, 
namely 

(A  -  l)ln(e_1)  =  ^  +  ln(2  +  e)  +  ^hi(—  ^  ^  ), 
which  one  can  rewrite  as 


ln(r  +  C)  =  —(1  —  p)  lne  —  pln(2  +  e)  —  InL, 

1  /  O 

where  p  =  A-i  and  L  =  It  follows  that 


1 


re  =  — 


L  \2  +  e 


p  1  i  2  Ct 

-Ce  =  L^-T^t' 


where  t  =  Note  that  0  <  t  <  1.  Now  we  can  take  c  to  be  the  supreme  value  of  re, 
which  yields 

2  Ct 


(2.11) 


i 

ft*  'ZUt  \ 

XT  ~  i-tJ 


c  =  sup 
0<t<l 


Finally,  (1.4)  follows  from  P  (<rn(H)  <  Cy/n)  <  2e  and  (2.2).  The  proof  of  the  theorem 
is  now  complete.  I 


Remark.  Although  there  does  not  seem  to  exist  an  explicit  formula  for  c  given  in  (2.11), 
there  is  a  very  good  explicit  approximation  of  it.  In  general,  the  t  that  maximize  ip(t)  is 
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rather  small.  So  we  may  approximate  simply  by  2 Ct  and  find  the  maximum  of 
(2.12)  (p{t)  =  yd  -  2Ct. 

1j 

A 

The  maximum  of  (p(t)  is  obtained  at  to  =  (2CXL)  X~1 .  This  to  is  very  close  to  the  actual 
t  that  maximizes  Thus 

<2-i3) 

has  c  <  c  and  it  is  a  close  approximation  of  the  optimal  c.  Of  course,  Theorem  1.2  still 
holds  when  c  is  replaced  by  c. 

Although  Theorem  1.2  is  for  real  Gaussian  random  matrices,  a  complex  version  of  it  can 
also  be  proved  with  minor  modifications.  A  complex  random  variable  Z  =  X  +  iY  has 
the  complex  standard  normal  distribution  if  both  X  and  Y  have  the  real  complex  normal 
distribution  jV(0, 1).  Theorem  1.2  extends  to  the  following  theorem  for  the  complex  case: 

Theorem  2.3.  Let  A  be  n  x  N  whose  entries  are  independent  random  variables  of  complex 
standard  normal  JV(0, 1)  distribution.  Let  A  =  ^  >  1.  Then  for  any  /r  >  0  there  exist 
constants  c,C>  0  depending  only  on  ji  and  A  such  that 

(2.14)  P  (cy^  <  <t n(A)  <  ai(A)  <  C\fn)  >  1  —  3e_Am. 

Furthermore,  we  may  take  C  =  \/2  +  2\/A  +  2 ^Jfi  and  c  =  sup0<t<1  <p(t)  where 

to  1  rn  u\  ^  2Ct  u  t  — 

(2.15)  ip{t)  =  — - ,  where  L  =  \  —e* * . 

L  1  —  t  V  A 

Proof.  The  proof  follows  the  same  argument  as  in  the  real  case  so  we  only  sketch  the  proof 
here.  In  particular  we  point  out  the  places  where  the  estimates  need  to  be  modified. 

Write  A  =  Ar  +  iAj  and  set  B  =  [Ar,Aj\.  Then  B  is  an  n  x  2N  matrix  whose 
entries  are  independent  real  standard  normal  random  variables.  It  is  easy  to  check  that 
ai(A)  <  \[2o\ (B).  Thus  by  taking  C  =  2\/A  +  \/2  +  2 we  have  via  (2.1)  that 

(2.16)  P  (<ti(A)  <  Cy/n)  <  e_Atn. 

The  estimate  for  crn(A)  follows  from  the  same  strategy  as  in  the  real  case.  First  of  all, 
just  like  the  real  case  for  any  n  x  n  unitary  matrix  U  the  entries  of  U A  are  still  independed 
complex  standard  normal  random  variables.  As  a  result  the  probability  P  (||A*v||  <  Vdn) 
where  v  G  Cn  is  a  unit  vector  does  not  depend  on  the  choice  of  v.  By  taking  v  =  ei  we  see 
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that  (||A*v||2  has  the  T(N,  1)  distribution  (as  opposed  to  the  r(^,l)  distribution  in  the 
real  case).  Applying  Lemma  2.1  we  obtain  the  equivalent  result  for  the  complex  case  in 

(2.17)  P  (||A*v||  <  VSn)  <  • 

Next  for  the  e-net,  we  observe  that  the  unit  sphere  in  Cn  is  precisely  the  unit  sphere  in 
M2n  if  we  identify  Cn  as  M2”.  Thus  we  can  find  an  e-net  Af£  of  cardinality  no  more  than 
(1  +  2e_1)2n.  The  proof  of  Theorem  1.2  now  goes  through  with  some  minor  modifications. 
The  most  important  one  is  that  with  (2.16)  and  (2.17)  the  inequality  condition  (2.7)  now 
becomes 

—  ^  >  ln(l  +  2e~1)  +  ^  ^ln2e  —  In  A  +  2  ln(c  +  eC)^  , 

where  the  constant  C  is  changed  to  C  =  2y/\  +  \[2  +  2 Substituting  this  C  and  ^  for 
/i  we  prove  the  theorem.  I 


3.  Random  Frames  as  NERFs 

Our  goal  in  this  section  is  to  establish  the  robustness  of  random  frames  against  erasures 
by  proving  Theorem  1.1.  Here  we  restate  Theorem  1.1  in  a  a  different  form  for  the  benefit 
of  simpler  notation  in  the  proof. 

Theorem  3.1.  Let  F  =  where  A  is  n  x  N  whose  entries  are  drawn  independently 

from  the  standard  normal  A7(0, 1)  distribution.  Let  A  =  ^  >  1  and  K  =  pN  =  p\n  where 
A-1  <  p  <  1.  For  any  tq  >  0  there  exist  constants  a,(3  >  0  depending  only  on  A,  p  and  To 
such  that  F  is  a  ( K ,  a,  j3)-NERF  with  probability  at  least  1  —  3e-T°n. 

Proof.  There  exists  exactly  K\^lKy_  subsets  S  C  {1,2,...,  iV}  of  cardinality  151  =  K.  It 
is  well  known  that 

N\  ^  Nn 

K\(N  —  K)\  ~  KK{N  -  K)n~k' 

which  can  be  shown  easily  by  Stirling’s  Formula  or  induction  on  N.  Set  sp  =  plnp-1  +  (1  — 
p)  ln(l  —  p)-1,  which  has  0  <  sp  <  In  2.  We  have  then 

(3.o  jTLjj<(p-'(i-rr=^. 

Now  we  set  p  :=  A sp  +  To-  Let  C  =  \j2p  +  \/pX  +  1  and  c  =  sup0<t<1  pit)  where  cp(t)  is 
given  in  (1.5).  Let  the  columns  of  A  be  {aj}j^1.  For  any  S  C  {1,2,...,  N}  we  denote  by 
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As  the  submatrix  of  A  whose  columns  are  {fj  :  j  £  5}.  Then  for  |5|  =  K  =  pXn  we  have 
P  ( cyfn  <  an(As)  <  <7i(%s)  <  Cyfn )  >  1  —  3e_/m. 
by  Theorem  1.2.  It  follows  that 

P  ^<t„(As)  <  cyfn  or  cxi(As)  >  Cyfn  for  some  S  with  |5|  =  K^j 

<  ^2  ^CTn(As)  <  Cyfn  or  Ot(As)  >  Cyfn ^ 

\S\=K 

<  3e(Xsp~^n  =  3e~T°n. 


It  follows  that 

P  ^ cy/n  <  an(As)  <  <7i(A,s)  <  Cyfn  for  all  S  with  l^l  =  K^j  >  1  —  3er°n. 

This  implies  that,  by  setting  a  =  c  and  (3  =  C,  F  =  pj^A  is  a  (K,  a,  /3)-NERF  with 
probability  at  least  1  —  3e_r°n.  I 

Theorems  1.1  and  3.1  states  that  random  Gaussian  frames  can  be  robust  with  overwhelm¬ 
ing  probability  against  erasures  of  an  arbitrary  proportion  of  data  from  the  original  data, 
at  least  in  theory,  as  long  as  the  number  of  remaining  vectors  is  at  least  (1  +  So)n  for  some 
So  >  0.  In  practice  one  may  ask  how  good  the  condition  numbers  are  if  the  erasures  reach 
a  high  proportion,  say,  90%  of  the  data.  We  show  some  numerical  results  below. 

Example  1.  Let  F  =  A^A  where  A  is  n  x  N  whose  entries  are  independent  standard 
normal  random  variables.  Set  tq  =  0.25.  In  this  experiment  we  fix  K  =  2 n  and  K  =  5 n, 
respectively,  and  let  N  vary.  As  N  increases  from  N  =  K  to  N  =  100/L  the  proportion  of 
erasure  s  =  1  —  ^  increases  from  0  to  99%.  We  shall  use  /3/a  as  a  measure  of  robustness 
since  it  is  an  upper  bound  for  the  condition  number.  Clearly,  as  s  increases  we  should  expect 
/3/a  to  increase.  The  left  plot  in  Figure  1  shows  log2(/3/a)  against  s  for  both  K  =  2 n  (top 
curve)  and  K  =  5 n  (bottom  curve).  Because  the  frame  is  normalized  so  that  each  column 
is  on  average  a  unit  norm  vector,  it  also  makes  sense  to  use  the  smallest  singular  value  as 
a  measurement  of  robustness.  The  right  plot  in  Figure  1  shows  —  log2(a)  against  s  also 
for  both  K  =  2 n  (top  curve)  and  K  =  5 n  (bottom  curve).  Our  numerical  results  show 
that  in  the  case  K  =  2 n,  with  probability  at  least  1  —  3e~°"5n,  the  condition  number  is  no 
more  than  10232  for  50%  erasures  and  no  more  than  611675  for  90%  erasures.  In  the  case 
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Figure  1.  Left:  log 2(/3/q:)  against  the  proportion  of  erasures  when  N  varies 
from  K  to  100/\  while  I\  is  fixed  at  K  =  2 n  (top  curve)  and  K  =  5 n  (bottom 
curve).  Right:  Same  as  in  the  left  figure,  but  for  —  log2(ct). 

K  =  5 n,  the  corresponding  numbers  are  139.88  and  1862.1,  respectively.  In  fact,  even  with 
99%  erasures  the  condition  number  is  no  more  than  42716. 

Example  2.  Again  we  let  F  =  where  A  is  n  X  N  whose  entries  are  independent 

standard  normal  random  variables,  and  let  to  =  0.25.  In  this  experiment  we  fix  N  =  200 n 
and  N  =  50n,  respectively,  and  let  K  vary  so  the  proportion  of  erasures  s  =  1  —  jj  varies 
from  0  to  99%  (N  =  200 n  and  0  to  97%  ( N  =  50 n),  respectively.  Again  we  should  expect 
the  robustness  to  go  down  as  we  increase  s.  The  left  plot  in  Figure  2  shows  log2(/3/a)  against 
s  for  N  =  50n  (top  curve)  and  N  =  200n  (bottom  curve).  The  right  plot  in  Figure  2  shows 
—  log2  (a)  against  s  also  for  both  N  =  50n  (top  curve)  and  N  =  200 n  (bottom  curve) .  Our 
numerical  results  show  that  in  the  case  N  =  50 n,  with  probability  at  least  1  —  3e~0,5n,  the 
condition  number  is  no  more  than  31.7  for  50%  erasures  and  1862.1  for  90%  erasures.  In 
the  case  N  =  200n,  the  corresponding  numbers  are  23.48  and  315.12,  respectively.  Even 
with  95%  erasures  the  condition  number  is  no  more  than  1312.4. 
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ON  THE  DECAY  OF  THE  SMALLEST  SINGULAR  VALUE  OF 
SUBMATRICES  OF  RECTANGULAR  MATRICES 


YANG  LIU  AND  YANG  WANG 


Abstract.  In  this  article,  we  study  the  decay  of  the  smallest  singular  value 
of  submatrices  that  consist  of  bounded  column  vectors.  We  find  that  that  the 
smallest  singular  value  of  submatrices  is  related  to  the  minimal  distance  of 
points  to  the  lines  connecting  other  two  points  in  a  bounded  point  set.  Using 
a  technique  from  integral  geometry  and  from  the  perspective  of  combinatorial 
geometry,  we  show  the  decay  rate  of  the  minimal  distance  for  the  sets  of  points 
if  the  number  of  the  points  that  are  on  the  boundary  of  the  convex  hull  of  any 
subset  is  not  too  large,  relative  to  the  cardinality  of  the  set.  In  the  numeral  or 
computational  aspect,  we  conduct  some  numerical  experiments  for  many  sets 
of  points  and  analyze  the  smallest  distance  for  some  extremal  configurations. 


1.  Introduction 

In  recent  decades,  measurements,  frames,  and  dictionaries  (see  for  instance,  [2], 
[24],  and  [5]),  all  of  which  are  essentially  matrices,  have  been  studied  and  used  in 
signal  processing,  such  as  compressed  sensing,  matrix  recovery,  phase  retrieval,  and 
other  fields.  As  the  main  characteristics  of  a  matrix  or  linear  transformation,  the 
singular  values  and  their  generalized  forms  have  been  studied  in,  for  instance,  [20], 
[18],  [9],  [28],  and  [23].  It  is  not  hard  to  see  that  the  singular  values  of  a  matrix 
are  determined  by  both  the  magnitudes  and  the  angles  of  the  row  vectors  of  the 
matrix. 

Rectangular  matrices  are  of  the  main  interest  in  some  recent  research  (see,  for 
instance,  [28]  and  [29]).  Here  we  call  a  rectangular  matrix  a  slim  matrix  if  there  are 
more  rows  than  columns  in  the  matrix.  Considering  the  columns  of  a  slim  matrix 
as  points  in  a  bounded  region  in  a  plane,  we  show  that  the  matrix  problem  can 
be  reduced  down  to  a  combinatorial  problem.  If  the  magnitudes  of  all  the  rows  of 
a  rectangular  matrix  are  bounded,  we  can  estimate  the  smallest  singular  values  of 
submatrices,  in  terms  of  the  size  of  the  matrix,  because  there  are  configurations  of 
matrices  whose  minimal  smallest  singular  values  by  the  order  of  a  power  of  the  size 
with  some  negative  exponent.  Some  estimates  on  the  distances  among  points  in  a 
set  or  the  distances  from  points  to  lines  that  connect  other  two  points  in  a  set  of 
points  in  a  bounded  region  are  established  in  this  article,  and  the  decay  rate  of  these 
distances,  in  some  sense,  essentially  determines  the  the  decay  rate  of  the  smallest 
singular  values  of  submatrices  with  bounded  column  vectors.  The  combinatorial 
geometry  problem  is  to  related  to  Heilbronn’s  triangle  problem  (see,  for  instance, 
[16]  and  [4]).  There  have  been  some  work  on  developing  algorithms  to  find  counter 
example  for  Heilbronn’s  original  conjecture,  but  there  does  not  appear  to  be  any 
experimentable  algorithm  for  one  to  find  any  explicit  or  concrete  sets  of  points,  and 
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it  would  be  interesting  to  see  the  optimal  arrangements  of  n  points  in  a  square  or 
unit  disk  for  Heilbronn’s  triangle  problem  and  this  problem  respectively.  However, 
we  formulate  a  conjecture  for  a  slower  decay  rate,  which,  as  far  as  we  know,  is  still 
open. 

The  main  contribution  of  this  paper  is  to  show  the  connection  between  the 
singular  value  problem  and  a  combinatorial  geometry  problem.  Using  a  technique 
from  integral  geometry  and  from  the  perspective  of  combinatorial  geometry,  we 
show  the  decay  rate  of  the  minimal  distance  for  the  sets  of  points  if  the  number 
of  the  points  that  are  not  on  the  boundary  of  the  convex  hull  of  any  subset  is 
not  too  large,  relative  to  the  cardinality  of  the  set.  We  also  obtain  some  other 
results  regarding  this  combinatorial  geometry  problem  in  some  cases,  and  so  for 
the  minimal  smallest  singular  value  of  submatrices  of  rectangular  matrices. 

This  paper  is  structured  as  follows:  in  Section  2,  we  prove  some  lemmas  on  the 
minimal  smallest  singular  value  of  slim  matrices,  and  particularly,  we  show  the 
optimal  decay  rate  for  the  base  case;  in  Section  3,  we  prove  a  duality  lemma  for  a 
the  minimal  smallest  singular  value  of  matrices  of  size  n  +  k  by  n;  in  Section  4,  we 
undertake  extensively  studies  on  the  minimal  smallest  singular  value  of  matrices 
of  size  n  +  3  by  n,  and  we  obtain  some  results  by  using  a  technique  from  integral 
geometry  and  from  the  perspective  of  combinatorial  geometry;  and  in  Section  5,  we 
present  some  numerical  experimental  results. 


2.  Some  Lemmas  on  the  Minimal  Smallest  Singular  Value 
First,  we  have  the  following  lemma. 

Lemma  2.1.  For  any  real  matrix  A  of  size  N  by  n  with  N  >  n,  one  has 


(2.1) 

an  (H)  > 

min  an  ( As ) 

SC{1„ 

..,n+l},|S|=n 

and 

(2.2) 

er  (A)  > 

min  lol  01  (Hs) 

S  C{1,. 

..,n+l},|5|=n 

Proof.  For  any  S  C  {1, . . . ,  n  +  1}  with  |S|  =  n, 

(2.3)  <jn(As)=  inf  IIAs^ll; 

•u£Kri',||u||=l 


and  on  the  other  hand, 

(2.4)  an(A)=  inf  \\A\V\\ 

VCR",dim(V)  =  l 


inf 

nGR™,  ||  v  ||  =  1 


Ill'll  • 


Since  Av  is  basically  an  vector  extension  oL4st;  for  every  v  £  K",  ||t>||  =  1,  then 

(2.5)  \\Asv\\  <  ||Hu|| 

for  every  v  £  K”,  ||t;||  =  1.  Thus,  it  follows  from  (2.3)  and  (2.4)  that 

(2.6)  an  (4s)  <  an  (A) 

for  any  S  C  {1, . . . ,  n  +  1}  with  IS)  =  n.  Hence,  we  obtain  (2.1),  and  similarly,  we 
also  obtain  (2.2).  □ 

From  the  growth  rate  of  the  smallest  singular  value  of  random  matrices  estab¬ 
lished  in  [3],  one  can  obtain  that 

(2.7)  an  (H)  -A  ^2  -V^Jy/n 
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for  N  =  2 n.  On  the  other  hand, 


(2.8) 


<7„  (As)  <  O 


Lemma  2.2.  For  any  n  +  1  by  n  matrix  A  = 


ai 


1, . . . ,  n  +  1,  one  has 


a„+i 


(2.9) 


min 

5C{l,...,n+l},|5|=n 


dn  (AS)  < 


with  ||aj||  <  1,  i  = 


Proof.  Since  ai,  . . .,  a„+i  are  linear  dependent,  there  are  ci,  . . .,  cn+  i,  such  that 


n+1 

(2.10) 

^  ^  —  0 

i= 1 

with 

n+1 

(2.11) 

i= 1 

Without  loss  of  generality,  assume  cn+1  =  min  (c i, . . . ,  cn+1).  If  cn+1  =  0,  (2.9)  is 
trivial,  because  there  is  an  S  such  that  As  is  singular.  It  suffices  to  consider  the 
case  of  c„+i  ^  0.  Therefore, 

n 

(2.12)  Cn+i  an+-|  =  'y  ( Cj3.j 

i— 1 


By  (2.11), 

(2.13) 

It  follows  that 

(2.14) 

and  furthermore 

(2.15) 


Since 

(2.16) 


n+ 1 


(' n  +  1)  c^+i  <  cf  —  1. 


|Cn+l|  ^ 


Vn  +  1 


^71+1^71+1  II  ^  1 


1  -  r 2 
1  cn+ 1 


yjn  +  1 
y/n  + 1  \/n 


mm  an  (As)  <  — 

SC{l,...,n+l},|S|=n  VJ2i= 1  ci 


|  Cn+lRn-Ll  | 


1  _  c2 
1  cn+l 


thus  (2.9)  follows  from  (2.15). 


□ 


Remark  2.3.  However,  one  can  have 


(2.17) 


min 

5C{l,...,n+l},|5|=n 


<7n  (As)  > 


1 

n 
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for  some  matrix  A.  For  example, 


(2.18) 

we  have 
(2.19) 


Ar  = 


0.9969  0.6688  0.1610 

-0.0782  0.7434  -0.9870 


1 


min  <7n  (Aa)  =  0.6115  > 

SC{l,...,n-{-l},|S|^=n  2 

For  matrices  of  size  n  +  1  by  n,  one  can  have  the  following 

ai 
l 

Lemma  2.4.  For  any  n  +  2  by  n  matrix  A  = 


an+ 2 

c 

min  cr„  (As)  <  — ^ 

SC{l,...,n+2},|S|=n  H3'2 


with  ||aj||  <  1 ,  i  = 


1, . . . ,  n  +  2,  one  has 

(2.20) 

for  some  constant  C  >  0. 

Proof.  It  suffices  to  consider  matrices  of  size  n  +  2  by  n  with  rank  no  less  than  n. 
Then  for  any  z  £  ker  (A)  with  ||z||  =  1,  we  have 

°n  (As)  <  infzeker(A)  l|j4szsl1 


(2.21) 


=  inf, 

<  inf, 

<  inf 


zGker(A) 

zGker(A) 


zGker(A) 


<  inf 


zGker(A) 


IM 

\\zi1ail+zi2ai2\\ 

llzs 

|Z»1  |  I|a«l  I 

'+KIII 

II 

zs  II 

\zii\+\zi2 

1 

llzsll 

+ 

l-(z2  +z2  ) 
V  M  *2/ 


where  S  =  {1, . . . ,  n  +  2}  \  {*x,  *2}  for  all  1  <  h,  ^2  <  n  +  2. 

Let  bi  and  b2  be  an  orthonormal  basis  of  ker  (A),  bi  =  (6n, . . . ,  i>i)n+2)  and 

bi 

b2 


:=  B.  Since  z  €  ker  (A)  with  |jz||  =  1, 


b2  =  (b 21, . . . ,  b2,n+2),  and  denote 
there  exist  t\  and  t2  such  that 
(2.22)  z  =  fibi  +  ^b2 

with  t2  +  t2  =  1.  Therefore, 

(2  23)  \J z*i  z*2  =  V  +  t^b-i, ij2  +  +  t2b2.i2)~ 


=  ||  (*i  ?  *2)  Bsc 


Combining  (2.21),  we  have 
(2.24) 


<Xn(As)<C  inf  \\(t1,t2)  BSc\\  =  Ca2(BSc) 
*?+*!  =  ! 

for  some  constant  C  >  0  and  furthermore, 

(2.25)  min  cr„  Mg)  <  C  min  <79  (B< 

SC{l,...,n+2},|S|=n  SC{l,...,ra+2},|S|=n 

Now  let  B  =  (/3i, . . . ,  /3n+2)  and  normalize  the  columns  of  B,  then 

%i  Pi2 


(2.26) 


er2  (BSa)  <  maxdlAJ  ,  ||A2||)  cr2 


WM’WPi: 
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Now  we  can  choose  the  indices  i\  and  12,  1  <  ii,  12  <  n  +  2,  such  that 

(2-27)  bl,h  +  &2,»i  +  &l,i2  +  &2,i2  =  II+HIf 

is  the  smallest  among  all  pairs  of  indices  between  1  and  n  +  2,  but  since 

n+2  n+2 

(2-28)  E6M  +  E6h  =  2’ 


we  have 

(2.29) 

which  implies 

(2.30) 


'Mi  T-  u2,i! 


bl 


bl < 


2’i2  “  n  +  2’ 


max  (^/blH+blH,  yjb\,l2+bl,l2)  < 
Therefore,  by  (2.26),  we  have 


(2.31) 


minsc{l,...,n+2},|S|=n  +2  (-E+)  < 


/  n+2 


02 


•\/n  +  2 

ftl  ft; 


< 

—  Vn+2 


iiftiir  ii^2 

fti  ft2 

P+l  11*2 


Considering  the  geometry  of  n  +  2  vectors  on  the  unit  circle  and  choose  the  closest 
two  vectors  among  the  n  +  2  unit  vectors,  we  know 


<  2  sin  ■ 


+ 

+ 

11+ II 

11+ II 

(2.32) 

Next,  we  will  show  the  following  inequality 

(2.33) 

Suppose  that 

(2.34) 


n  +  2 


,  ,  2y/2  it 

mm  <j 2  (Bsc)  <  ,  sin - 

SC{l,...,n+2},|S|=n  y/n  +  2  n  +  2 


,  ,  2y[2  7 r 

0-2  (Bsc)  >  ,  sin 


for  all  S  C  {1, 
(2.35) 
we  have 


(2.36) 


yjn  +  2  n  +  2 
n  +  2}  with  | S'!  =  n.  For  any 
£+  =  (++, 


T2  (BSc)  < 


.ft 
lift 

for  1  <  *  <  j  <  n  +  2.  We  can  actually  arrange  the  indices  in  /%,  1  <  i  <  n  +  2, 
so  that  |pi|y)  1  <  i  <  n  +  2,  are  in  the  counterclockwise  order  in  the  unit  disk.  By 
(2.28),  we  know  that 

n+2 

(2-37)  E  H&H2  =  2> 
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and  since  any  chord  is  shorter  than  its  corresponding  arc  on  a  circle,  we  have  that 

(2-38>  gllra-Kall^ 

assuming  (3n+ 3  =  /3i .  From  (2.36), 

(2.39) 

SC{1,...“?2},|S|=n£72  {Bsc)  ~  nml'<«J<n+i  (mindl^H  ’  Ill'll)  I  m\  -  ral) 


<  mmi<j<„+2  f  ||/3*H  |||*||  ||!‘+i|| 


From  (2.37)  and  (2.38),  one  can  obtain  that 
(2.40) 


mmi<i<n+2  ||  Pi 


_ A 

wt  w 


t+ill  ) 


<  ^ESHlIftllll^-^ 

^  _2\/27t 


ft  _  Pi+i 


and  then  (2.20)  follows. 


3.  Duality  Lemma  for  matrices  of  size  n  +  k  by  n 

First  we  have  the  following  duality  lemma  in  general. 

Lemma  3.1.  For  any  matrix  A  of  size  m  by  n  with  m>  n  with  all  rows  normalized 
to  1,  one  has 

(3.1)  min  an  (Ag)  <  C  min  an  (Bt) 

5'C{l,...,m},|5'|=n  TC{l,...,m},|T|=m— n 

for  some  constant  C  >  0,  where  B  consists  of  any  orthogonal  basis  of  ker  (A). 

Proof.  Then  for  any  z  €  ker  (^4)  with  ||,z||  =  1,  we  have 

<*n  Os)  <  inf,€ker(^> 

=  inf .et.(4)  l|,-‘i|.+»?"jl1 

(3-2)  <  inf.a„,fl  1 1 H*’1 1  ^'7” " 

^  ;,.f  l*s*h 

—  mtzeker(A)  ||zgjj 

^  y(jl bgglb 


^1*^1  1  z2  %2  II 

llzs  I 

^l_[Jhl]  +|g»2  |  Ilai2  | 

I  zsll 

ysylh 


Let  bx  and  b2  be  an  orthonormal  basis  of  ker  (A),  bi  =  (bn, . . . ,  6iira+2)  and 
=  (621, . .  • ,  i>2,n+2),  and  denote  (  J)*1  ^  :=  B.  Since  z  €  ker  (^4)  with  ||z||  =  1, 


b2  =  (&2ij  •  •  • ,  ^2,71+2),  and  denote  ^  ^  J  :=  B.  Since  z  €  ker 
there  exist  <1  and  1 2  such  that 

(3.3)  z  =  ti?gc 

with  t\  +  =  1-  Therefore, 

(3.4)  ||*HI2  =  lltfiscll 
Combining  (3.2),  we  have 

(3.5)  <jn  (As)  <  C  inf  ||tBgc||  =  Cam-n  (BSc) 
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for  some  constant  C  >  0,  where  Sm  ”  is  the  unit  sphere  in  Rm  n+1,  and  further¬ 
more, 


(3.6) 


min  crn  (As)  <  C  min  <rn  ( Bp )  ■ 


□ 


Remark  3.2.  In  matrix  theory  and  operator  theory,  the  image  of  an  operator  is 
regards  as  to  be  dual  its  kernel  or  null  space.  Here  this  duality  is  in  a  similar  essence 
to  relationship  between  the  restricted  isometry  property,  Johnson-Lindenstrauss 
embedding,  and  the  null  space  property  in  signal  processing,  including  compressed 
sensing,  phase  retrieval,  and  others  (see  for  instance,  [27],  [17],  and  [26]). 


4.  Decay  rate  for  matrices  of  size  n  +  3  by  n 


Let  Pi, . . . ,  Pn  be  in  the  unit  disk  on  the  plane,  and  d{i,j ,  k )  be  the  distance  of 
the  point  Pi  to  the  line  connecting  other  two  points  Pj  and  Pfc  ,  1  <  i,j,  k  <  n.  In 
this  section,  we  want  to  study  the  decay  of  mini <i,j,k<n  d(i,j,  k),  asn->  oo. 

First,  let  us  prove  the  following  lemma  on  the  decay  order  of  at  least  O  (2-). 

Lemma  4.1.  Let  Pl,  . . .,  Pn  be  a  set  of  points  in  the  unit  disk  on  the  plane.  Suppose 
that  Pi,  . . .,  Pn  are  on  the  boundary  of  the  convex  hidl  of  the  point  set  {Pi, ...  ,Pn} 
and  d(i,j,k)  is  the  distance  of  the  point  Pi  to  the  line  connecting  other  two  points 
Pj  and  Pfc,  1  <  i,j,k  <  n,  then 

<4-1) 

for  some  absolute  constant  C,  C  >  0,  independent  of  n. 


Proof.  Let  us  cover  the  unit  disk  by  parallel  stripes  of  width  -,  then  the  unit  disk 
can  be  covered  by  |" such  stripes.  By  the  pigeonhole  principle,  there  exist  at  least 
3  points  Pi0 ,  Pj0ancl  Pfc0  which  locate  in  the  same  strip,  thus  we  have 


(4.2) 


min  d(i,  j,  k)  <  d(i0,j0,  k0)  < 


8 

n 


□ 


Next,  we  prove  the  following  lemma. 


Lemma  4.2.  Let  Pi,  . . .,  Pn  be  a  set  of  points  in  the  unit  disk  on  the  plane.  Suppose 
that  Pi,  . . .,  Pn  are  on  the  boundary  of  the  convex  hidl  of  the  point  set  {Pi, . . . ,  Pn} 
and  d(i,j,k)  is  the  distance  of  the  point  Pi  to  the  line  connecting  other  two  points 
Pj  and  Pfc,  1  <  i,j,k  <  n,  then 


(4.3) 


min  d(i,  j,  k )  < 

\<i,j,k<n  v  J  ’  ~ 


C 


for  some  absolute  constant  C,  C  >  0,  independent  of  n. 


Proof.  Without  loss  of  generality,  we  assume  that  the  points  Pi,  Pi,  ■  ■ .,  and  Pn 
are  in  the  counterclockwise  order  in  the  unit  disk.  Firstly,  if  Pi,  ...,  P„  are  the 
vertices  of  a  convex  polygon  P,  then  by  the  Crofton  formula  in  integral  geometry 
or  geometric  probability  (see  for  instance  [15],  [22],  and  [30]), 

(4.4)  perimeter  (P)  =  ^  f  f  np{9,r)drd6, 

2  J  o  J  o 


67 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


DECAY  OF  THE  SMALLEST  SINGULAR  VALUES  OF  SUBMATRICES 


8 


where  np  ( 9,r )  is  the  intersection  number  of  the  the  polygon  and  the  oriented  line 
which  has  a  distance  r  to  the  origin  and  has  an  angle  8  to  the  positive  horizontal 
axis.  Let  C  be  the  unit  circle,  again  by  the  Crofton  formula,  we  know 

1  r27r  f1 

(4.5)  perimeter  (C)  =  -  /  /  nc{8,  r)  drdd. 

2  J0  J 0 

But  since  the  polygon  P  is  convex,  then 

(4.6)  np  (0,  r)  <  2  =  nc  (9,  r) , 
and  it  follows  from  (4.4)  and  (4.5)  that 

(4.7)  perimeter  (P)  <  perimeter  (C)  =  \  f  f  2drd0  =  2tt. 

2  Jo  Jo 

Thus  the  sum  of  the  boundary  edges  of  the  polygon 

n 

(4-8)  E  \\P^\\  ^  2lT- 

i- 1 

Now  let  us  connect  the  vertices  by  edges  P1P3,  P2P4,  P„_iPi,  and  PnP2,  then 
we  have 

n 

(4.9)  Y  (^Pi+2PjPi+i  +  ZPi+iPi+2Pi)  =  mr  -  (n  -  2)  n  =  2tt 

i— 1 

assuming  Pn+ 1  =  Pi  and  Pn+ 2  =  P2,  because  there  are  n  triangles  and  the  sum  of 
the  interior  angles  of  the  polygon  is  (n  —  2)  7 r.  Furthermore,  since 

(4.10) 

n  n 

(sin  (ZPj_j_2PiP?:+l)  +  sin  (ZPi+iPi+2Pi))  <  (^Pj+2PiPj+l  +  ZPi+iPi+2Pi) 

(-1  i=l 

therefore,  we  have 

n 

(4-11)  E  (sin  (ZPl+2PiPl+ 1)  +  sin  (ZPj+iP,;+2P;))  <  27r. 

i=l 

By  Cauchy-Schwarz  inequality, 

(4.12) 

E?=1  (sin 5  (ZP,+2PiP,:+1)  +  sin 5  (ZP,+1PJ+2P,:)) 

<  (E?=!  2  ll^+TII)  (Er=i  (sin  (ZPJ+2P,:PJ+i)  +  sin  (ZPl+1Pl+2Pl))) . 

It  follow  from  (4.8)  and  (4.11)  that 

(4.13) 

n 

E  H-Pi-Pi+ill 2  (sin 5  (ZPi+2PtPi+ 1)  +  sin 5  (ZP2;+1PJ+2P,:))  <  4tt  •  2?r  =  8tt2. 

i=l 

Since  there  are  actually  2 n  terms  in  the  above  sum,  then  we  have 

M _ ni  i  87 47r^ 

(4.14)  min  PjP,;+i  2  sin  2  (ZPi+2PtPi+1)  <  —  =  - 

i<i<n 1111  2  n  n 

or 

.. _ ..  i  i  Sir^1  47r^ 

(4.15)  min  PiPi+1  2  sin  2  (ZPj+iP,+2P;)  <  —  =  - . 

l <i<n "  "  2n  n 
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Notice  that 

(4-16)  _ 

d(i  +  l,i,i  +  2)  =  \\PiPi+1\\sm(ZPi+2PiPi+i)  = 

thus  by  (4.14)  and  (4.15)  we  know  that 

(4.17)  min  d(i  +  1,  i,  i  +  2)  < 

l<i<n 

and  the  claim  in  Lemma  4.2  follows,  in  the  case  that  Pi,  . . .,  Pn  are  the  vertices  of 
a  convex  polygon. 

Secondly,  if  a  point  is  on  the  boundary  edges  of  a  convex  hull  of  the  point  set  but 
is  not  a  vertices  of  the  convex  polygon,  then  the  distance  of  the  point  to  the  edge 
which  the  point  is  on  is  zero.  Thus  the  claim  in  Lemma  4.2  automatically  holds  in 
this  case.  □ 

Remark  4.3.  In  the  proof  of  the  above  lemma,  we  have  used  a  technique  from 
integral  geometry.  For  generalized  theory  of  it,  one  can  refer  to,  for  instance,  [31], 
[10],  [21],  and  [1]'. 

From  this  lemma,  we  can  derive  the  following  corollary  immediately. 

Lemma  4.4.  Let  Pi,  Pn  be  a  set  of  points  in  the  unit  disk  on  the  plane. 
Suppose  that  P^,  . . .,  Pin_„,  0  <  s  <  n  —  4,  are  on  the  boundary  of  the  convex  hull 
of  the  point  set  {Pq , . . . ,  Pin_B }  and  d(i,  j ,  k)  is  the  distance  of  the  point  Pi  to  the 
line  connecting  other  two  points  Pj  and  Pk,  1  <  i,j,  k  <  n,  then 

C 

(4.18)  min  d(i,j,k)<- - 

1  <i,J,k<n  ( n  —  S) 

for  some  absolute  constant  C,  C  >  0,  independent  of  n.  In  particular ,  if  s  < 
we  have 

4  C 

(4.19)  min  d(i,j,  k)  <  —^ . 

More  generally ,  if  s  <  [An]  for  some  absolute  constant  A,  A  >  0,  independent  ofn, 
then 

(4.20)  min  d(i,j,k)  <  -j. 

l<i,3,k<n  Tlz 

for  some  absolute  constant  C,  C  >  0,  independent  of  n. 

Remark  4.5.  Note  that  s  <  n  —  4,  because  by  the  Sylvester-Gallai  theorem  (see  for 
instance  [6]  and  [14]),  if  all  the  points  are  not  collinear,  there  is  a  line  which  passes 
through  exactly  two  of  the  points,  but  (4.3)  will  trivially  hold  if  there  exist  three 
points  in  the  point  set  that  are  colinear  and  here  we  only  need  to  consider  the  sets 
of  n  points  which  have  exactly  ordinary  lines,  on  which  one  can  refer  to  [11], 

and  also  by  the  Erdos-Szekeres  theorem  (see  for  instance  [8]  and  [25]),  any  set  of 
n  generic  points,  n  >  4,  in  the  plane  has  at  least  4  points  that  are  the  vertices  of  a 
convex  quadrilateral. 

In  [7]  and  [13],  a  set  of  2n~2  points  that  contains  no  convex  n- gon  was  con¬ 
structed.  We  will  analyze  the  minimal  distance  mini <i,j,k<n  d(i,  j,  k)  for  this  ex¬ 
tremal  case.  Let 

(4-21)  Sk.i  :=  |  (x,  yk,i  (x))  :  1  <  x  <  ^  ^  ^  j 


|  Pj+i Pi+2 1 1  sin  (ZPi+1Pi+2Pj) , 
167T4 
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and  define  yk,i  (x)  inductively  as  follows: 


(1)  yk>1  (1)  =  yhi  (1)  =  1; 

(2)  if  k  >  1,  l  >  1,  then 

(4.22)  yky  ( x )  =  yk,i- i  (a:) 

for  1  <  x  <  (kk'ffi3)  and 

k  +  l-  3 
k  -  1 


(4.23)  ykj  (x)  =  yk-i,i  - 

for  (^f)<x<(^f),  where 


(4.24)  ak>l  = 


k  +  l-  2 
k-  1 


max 


k  +  l  —  3 
k-  1 


Gfc.i 


>  2/fc-u 


k  +  l  —  3 
k- 2 


From  the  inductive  definition  of  (x)  ,  we  know  that  yk,i  linearly  depends  on 
yk,i- 1  and  yk-ij.  By  [13],  yk,i  (x)  is  monotone  increasing  with  respect  to  x  for 
1  <  x  <  (  t^3)-  But  ykji  (x)  increases  dramatically  when  x  becomes  large. 

Now  let  us  consider  Sn,n,  the  cardinality  of  Sn_n 


(4-25)  ^=(^n-l)' 

To  preserve  the  convexity  and  concavity  of  subsets  mSn^n  and  confine  it  into  the 
unit  square,  we  use  a  similarity  transformation 

/  ((n-l)!)2  q  \ 

(4.26)  T=[  (2”-2)!  1 

V  »»,  n((-t))  j 

and  then  T  (S„,n)  C  [0,  l]2.  Since  Srun  is  one  of  the  components  of  the  set  of 
N  =  22"-2  points  Rn  that  contains  no  convex  n-gon,  T  ( Snt„ )  is  the  one  of  the 
components  of  the  set  of  N  =  22n~2  points  in  [0,  l]2  that  contains  no  convex  n-gon. 
From  the  figure  4.1,  we  can  see  that  the  minimal  distance  mini <i,j,k<nd(i,j,k)  in 
Srhn  multiplied  by  N 2  =  24"-4  is  very  likely  bounded,  that  implies  the  minimal 
distance  mini <i,j,k<nd(i,j,k)  in  the  set  of  N  =  22"-2  points  Rn  should  decay  at 
the  rate  of  at  least  O  (yj^)- 

Considering  the  configurations  of  n  points  in  the  unit  disk,  we  have  the  following 
lemma  first. 


Lemma  4.6.  Let  D  be  the  unit  disk,  then 

(4.27)  min  d(vi,  Vj,  vk)  <  2  sin2  — 

l<i,j,k<n  Tl 

for  all  vi,  i>2, . ,vn  £  D  for  n  =  3  and  4.  Therefore 

2t r2 

(4.28)  min  d(vi,Vj,vk )  <  — 

1  <i,j,k<n  n2 

for  all  v i,V2, . ,  vn  £  D  for  n  =  3  and  4. 


Proof.  For  n  =  3,  there  are  three  points  V\ ,  v-2  and  V3  in  D.  Without  loss  of 
generality,  we  can  assume  that  the  side  V1V2  is  the  longest  side  and  v±  and  V2  lie  on 
the  boundary  of  D,  denoted  as  <90,  because  one  can  use  translations  and  rotations. 
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Figure  4.1.  Plotted  above  are  the  smallest  distances  in  Sn^n  mul¬ 
tiplied  by  24n~4. 


Let  v3  be  the  intersection  of  the  line  parallel  to  the  side  v3v2  and  its  perpendicular 
bisector.  Then  we  have 


(4.29) 


d(v  3,v1,v2)  =  d(v3,v1,v2) 


and  then  the  minimal  heights  of  the  triangle  AV1V2V3  and  Aviv2v'3  are  equal, 
because 


(4.30) 


—7 

— t 

Vlv3 

v2v3 

<  max  ( || ^2^3  II  j  ll^sll) 


11^2 


in  other  words,  v\v2  is  also  the  longest  side  of  Av\v2v3,  and  the  areas  of  the  triangle 
Aviv2v3  and  Aviv2v3  are  equal. 
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Now,  let  us  move  v3  along  the  perpendicular  bisector  of  the  side  v\v 2  towards 
the  direction  in  which  the  height  increases,  until  it  touches  the  boundary  3D  at 
a  point  denoted  by  v".  Let  the  distance  from  a  point  V3  ( t )  on  the  perpendicular 
bisector  of  the  side  uiu2  to  the  side  V1V2  be  t,  then  the  minimal  height  of  the  triangle 
Ai>iV2u3  (<) 

(4.31) 


t, 


0  <  t  < 


^11* 


t  > 


increases  as  t  increases.  Thus  the  minimal  height  of  the  triangle  Av\V2V3  is  greater 
than  or  equal  to  that  of  the  triangle  Av\V2V3. 

Then,  we  can  do  a  regularization  for  the  whose  vertices  all  lie  on  3D.  If 

one  of  the  vertices  does  not  bisect  the  arc  ending  with  the  other  two  vertices,  and 
without  loss  of  generality,  we  can  assume  that  v3  does  not  bisect  the  arc  ending 
with  i>i  and  V2,  then  move  v”  to  the  midpoint  of  the  are,  and  then  the  new  triangle 
lying  on  3D  has  a  great  minimal  height,  by  comparing  trigonometric  functions. 
Thus,  the  equilateral  triangle  lying  on  3D  has  the  greatest  minimal  height.  This 
finishes  the  proof  for  the  case  of  n  =  3. 

For  n  =  4,  there  are  two  cases  to  consider,  but  we  will  be  able  to  find  the 
maximum  of  the  minimal  heights  for  both  cases.  The  first  case  is  that  one  of  the 
four  points  is  in  the  interior  of  the  convex  hall  of  the  other  three  points.  Let’s 
assume  that,  14  is  in  the  interior  of  the  convex  hall  of  the  other  three  points  V\ , 
V2  and  V3.  Then  if  we  fix  v\ ,  u2  and  1)3 ,  the  maximum  of  the  minimal  heights  for 
this  case  is  reached  when  14  is  at  the  center  of  the  incircle  of  the  triangle  Auiu2i>3, 
because  otherwise,  the  minimal  height  mini<jiJifc<4  d(vi,  Vj,  Vk)  would  be  less  than 
the  radius  of  the  incircle  of  the  triangle  Ai>itJ2U3.  Using  an  argument  similar  to  the 
case  of  n  =  3,  we  can  show  that  in  this  case, 

1  7T 

(4.32)  min  d(vi,Vj,Vk )  <  -  <  2sin2  — . 

The  second  case  is  that  the  four  points  are  all  on  the  boundary  of  the  convex 
hull  of  the  point  set  {iq, i>2,  V3, 14}.  One  can  always  find  a  rectangle  R  inside  the 
quadrilateral  which  has  the  same  minimal  height  of  the  triangles  of  the  rectangle 
R  as  the  minimal  height  of  the  triangles  of  the  quadrilateral.  By  translations  and 
dilations,  one  can  obtain  another  rectangle  R'  on  3D  of  which  the  minimal  height 
of  the  triangles  is  no  less  than  minimal  height  of  the  triangles  of  the  rectangle  R. 
Through  maximizing  a  simple  function,  one  can  get  that 

(4.33)  min  d{vi,  Vj,  Vk)  <  2  sin2  — 
in  this  case. 

In  general,  if  all  the  points  are  on  the  boundary  of  the  convex  hull  of  the  point 
set  {ui,  i>2, . . . ,  i>„},  we  have  □ 


Lemma  4.7.  Let  D  be  the  unit  disk,  then 

(4.34)  min  d(t>i,Vj,Vk)  <  2 sin2  — 

l<i,j,fc<n  Tl 
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for  all  v  . ,vn  el  if  all  the  points  are  n  the  boundary  of  the  convex  hull 

of  the  point  set  {vi,V2,  ■  ■  ■ ,  vn}.  Therefore 

,  2tt2 

(4-35)  . 


min  d(vi,Vj,Vk)  <  9 

1  <i,j,k<n  n2 


for  all  Vi,V2, . ,t)„el  if  all  the  points  are  on  the  boundary  of  the  convex  hull 

of  the  point  set  {vi,V2,  ■  ■  ■ ,  vn}. 

Proof.  If  all  the  points  are  on  the  boundary  of  the  convex  hull  of  the  point  set 
{v\,V2,  ■  ■  ■ ,  f„},  we  can  move  the  points  {vi,V2,  ■  ■  ■ ,  vn}  towards  the  boundary  and 
have  a  convex  n-gon  whose  vertices  {v'i,v2, . . .  ,  tftjare  on  90  whose  perimeter  is 
no  less  than  that  of  the  n-gon  {vi,  i>2, . . . ,  vn}  ,  because  suppose  that  a  vertex  i>j0is 
not  on  the  boundary  90,  then  the  level  set 

(4.36)  (ueD:  ||w*0-i||  +  ||wi0+i||  =  |K^o-i||  +  IKwio+i||}  > 

where  Uj0_i  and  u,;0+i  (assuming  vn+i  =  v\  )  are  the  adjacent  vertices  of  u*0,  is  a 
ellipse.  Connect  the  center  of  the  disk  D  and  Ui0  by  a  ray  and  extend  the  ray  till  it 
intersects  the  boundary  90  at  v'io,  then 


A 

vioVio~  l 


J 


3^0  +  1 


>  ll^o^o-i II  +  II^vvkII  • 


(4.37) 

Thus 

(4.38)  perimeter  [yiv'2  ■  ■  ■  v'^j  >  perimeter  {v\V2  ■  ■  ■  vn) . 

Let  9i  be  the  central  angle  of  the  chord  v'v'+1,  assuming  v'n+l  =  .  Then 


=  2  n  sin 


. _ .  n  0  ( 

(4.39)  perimeter  ■  ■  ■  v'n)  =  2 ^ sin  -f-  <  2n sin  ( 

i— i  k 

by  the  concavity  of  the  sine  function.  Combining  (4.38)  and  (4.39),  we  have 


E?=i 

2  n 


(4.40) 


perimeter  (V1V2  ■  ■  ■  vn)  <  2nsin  — . 


Let’s  denote  the  angle  between  vfvfft  and  vfvf+ \  by  a*  and  the  angle  between 
Vi+2Vi+l  and  vi+2v\  by  ft,  assuming  vn+i  =  v\  and  vn+2  =  v2,  then 

n  n 

(4.41)  a-i  +  ft  =  2tt 

i= 1  i= 1 

and  furthermore,  we  have 

xr=i«i+Er=ift 


(4.42) 


E 


sm  ai 


sin  ft  <  2 n 


sm 


2  n 


7T 

=  2  n  sm  — 
n 


again  by  the  concavity  of  the  sine  function.  Let  s*  :=  ||ujUj+i||,  Xi  :=  sina.j  and 
y.i  :=  sin  ft  for  i  =  1, . . . ,  n,  then 


(4.43) 

and 

(4.44) 


i=l 


y  Xi  +  y  y-i  <  2 n  sin  — 
^ n 

i=  1 

n 

'y  Si  <  2 n  sin 
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(4.45)  F  :=  ^  Si  (xj  +  yi)  -  A  f  ^  Xj  +  ^  -  c\  -  y  f  ^  Sj  -  c2  ]  , 


2  =  1 


y  i—1  i= 1 


where  0  <  c\  <  2nsin  -  and  0  <  C2  <  2nsin  -.  Solving  the  system  of  equations, 


(4.46) 

and 

(4.47) 

and 

(4.48) 

that  is  A  =  Si,  and 

(4.49) 
that  is 

(4.50) 

we  get  Si  =  —  and 

(4.51) 


X! Xi + YI yi  =  ci> 

i=  1  i—  1 

n 

'y  \  Si =  C2 , 

i=l 

dXiF  =  0, 
dSzF  =  0, 

y  =  Xi  +  IJi , 


+  Vi  = 


Cl 

n 


for  i  =  1 , ,n.  By  the  method  of  Lagrange  multipliers  with  multiple  constraints 
(see  for  instance,  [19]  and  [12]), 


(4.52) 

which  implies 

(4.53) 


Y  Si  ( Xi  4-  yi)  <  <  4 n 

'  n 


i=l 


1  sin  — , 
n 


2nmin  I  min  SiXi ,  min  )  <  4nsin2  — . 


l<i< 


l<i<n 


Thus,  there  exists  an  ig,  1  <  io  <  n,  such  that  either 


(4.54) 
or 

(4.55) 

in  other  words,  either 

(4.56) 
or 

(4.57) 

which  implies  (4.27)  as  desired. 


^o*^o  —  2  sin 

n 


SioVig  <  2  sin 


|uioUj0+i||sin  aio  <  2  sin2  - 
tVvTill  sin  Pi0  <  2  sin2  - 


□ 
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Now,  let  us  consider  the  complementary  probability  that  any  point  does  not  fall 
into  the  stripes  around  the  lines  connecting  the  preceding  points.  To  obtain  the 
conditional  probability  each  time  when  a  point  is  dropped  into  the  disk,  one  needs  to 
have  a  lower  bound  of  the  covering  area.  This  approach  calculates  the  the  covering 
area  of  the  stripes  which  have  overlaps,  but  to  find  the  covering  area,  it  would 
depend  on  the  configurations.  For  example,  when  the  fourth  point  is  dropped  into 
the  disk,  there  would  be  a  difference  on  the  next  conditional  probability  whether 
the  point  is  dropped  into  the  interior  of  the  region  formed  by  the  three  preceding 
points  or  the  exterior  of  the  region.  More  precisely,  if  there  are  4  random  points, 
then  there  will  be  7  overlaps  (including  one  of  them  overlapped  by  three  stripes) 
among  the  stripes  if  three  points  form  a  triangle  whose  interior  contains  the  other 
point,  whereas  there  will  be  only  4  overlaps  among  the  stripes  if  4  points  form  a 
quadrilateral.  So  the  covering  area  depends  on  the  configuration  of  the  points  in 
the  unit  square  or  unit  disk. 

Furthermore,  one  would  need  to  have  a  significantly  small  probability  estimate 
on  the  minimal  distance  greater  than  -^4  or  more  strongly  in  order  to  show 
that  the  probability  that  the  minimal  distance  is  less  than  (=4  is  significantly  high. 
Thus,  if  one  uses  the  probability  approach,  the  covering  area  of  the  stripes  may  be 
estimated.  But  the  obstruction  caused  by  configurations  or  convexity  is  still  the 
main  hard  part  to  solve  the  problem  completely  by  soft  analysis  or  by  quasi-exact 
hard  analysis. 

Let’s  look  into  the  subdivisions  of  the  unit  square  now.  Let  S'  be  a  set  of  n  points 
in  the  unit  square.  Let  qn  be  the  maximum  of  the  minimal  distance  from  any  point 
of  S  to  the  line  joining  any  other  two  points  of  S,  in  which  the  maximum  is  taken 
over  all  configurations  of  n  points  in  the  unit  square,  and  pn  =  nqn.  Suppose  So 
is  the  configuration  that  achieves  the  maximum,  and  divide  the  unit  square  into 
4fc  sub-regions  of  equal  area  and  equal  shape,  by  using  the  midpoints  of  the  edges, 
with  a  suitable  arrangement  of  the  boundaries  so  that  every  point  belongs  to  only 
one  sub-square.  We  have  the  following  lemma  regarding  the  behavior  of  pn. 


Lemma  4.8.  Suppose  that  a  sub-region  contains  no  more  than  -grn  points  of  So 

(4:k+l—l\n 

for  some  l  >  0.  Then  there  exists  an  rq,  such  that  <  ni  <  n  and 

(4fc-i)(4fc+;) 

Pn  <  2fc(4fc+Z  — 1)  Pn  1  ' 


Proof.  By  pigeonhole  principle,  there  exists  an  sub-region  Q  that  contains  at  least 

(4fc+Z—  l)n 

(4fr-i)(4fc+t)  P°ints  °f  So-  Let  rq  be  the  number  of  points  of  So  that  falls  into  Q. 
Then 


(4.58)  q„  <  min  ^d(vi,vj,vk)  <  ^qni  =  7^—Pm  < 


Vi,Vj,vkeQ 


2kni 


(4fe  -  1)  (4fc  +  l ) 
2k  (4fc  +  1-1)  n 


Pm. 


Thus  it  follows  that 
(4.59) 


Pn  < 


(4fc  -  1)  (4fc  +  l) 

2k  (4fc  +  Z  -  1)  Pni 


□ 


Let  us  continue  considering  the  subdivisions  of  the  unit  square. 
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Lemma  4.9.  Let  v\f  vn  be  a  set  of  n  points  in  the  unit  square  on  the  plane, 
and  connect  all  pairs  of  points  by  line  segments.  Given  any  e,  0  <  e  <  1,  there 
exist  more  than  distinct  line  segments  whose  length  is  less  than  e. 

Proof.  Let  us  divide  the  unit  square  into  4fc  sub-squares  of  equal  area,  by  using 
the  midpoints  of  the  edges,  with  a  suitable  arrangement  of  the  boundaries  so  that 
every  point  belongs  to  only  one  sub  square  and  connect  every  pair  of  points  in  the 
same  sub  square  by  line  segment. 

For  any  given  £,  0  <  £  <  1,  there  exists  an  k  such  that 


(4.60) 


£  £ 


Let  Ui  be  the  number  of  points  in  the  i- th  sub-square,  i  =  1, . . . ,  4fe,  then  n  = 
Sj=i  nii  and  the  total  number  of  the  line  segments  in  the  sub-squares  is 


(4.61) 


E 

i= 1 


Hi  (Hi  -  1)  1 


-  b  E  (n»  - 1) = 


n  — 


since  tudjh — 11  =  0  if  rq  =  0  or  1.  Furthermore,  by  (4.60), 


n  —  4fc 


n  4 

>  - - -r. 


<«2)  2  '  2 

Thus,  the  total  number  of  line  segments  in  the  sub-squares  is  greater  than  —  4,  J , 
and  the  length  of  each  line  segment  is  less  than  £,  since  the  length  of  each  side  of 
the  sub-squares  is  that  is  less  than  £  by  (4.60).  □ 

On  the  angles,  one  has  the  following  lemma. 

Lemma  4.10.  For  any  a  >  0,  among  the  angles  between  the  distinct 

i(ri£2  —  8) 


lines,  there  exist  at  least 


2e2(a+7r) 


angles  less  than  a. 


Proof.  Take  any  point  in  the  plane  as  the  vertex  of  the  angle  tt  and  divide  the  angle 
into  +  lj  smaller  angles  of  equal  degree.  We  can  do  parallel  transports  on  the 
lines  so  that  they  pass  through  the  vertex  of  the  angle  tt.  Then  by  the  pigeonhole 

(ne2  — s) 


principle,  there  must  be 
than  a. 


2£2(o:+7r) 


lines  falling  into  the  same  angle,  which  is  less 

□ 


Considering  the  edge  and  angle,  one  has 

Lemma  4.11.  If  the  smallest  angle  and  edge  are  adjacent,  then 

C 

(4.63)  min  d(vi,  Vj,  vk)  <  — . - 

1  <i,j,k<n  nlogn 

for  a  constant  C  >  0. 

Proof.  Choose  £  =  and  a  =  f ,  then 


(4.64) 


>  1 
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for  n  >  6  and 

(4.65) 

for  n  >  15.  Therefore, 

(4.66) 


a  ( ne 2  —  8) 

2  e1  ( a  +  7 r) 


min  d(vi,Vj,Vk)  < 


1 


for  n  >  15,  and  then  (4.63)  follows. 


>  1 


■  sin  —  < 


log  n  n  n  log  n 


□ 


We  used  quasi-exact  hard  analysis  to  obtain  the  decay  rate.  However,  the  tools 
or  techniques  in  hard  analysis  may  be  used  to  obtain  the  same  order  of  decay  but 
probably  better  constant  in  the  decay  rate.  From  the  perspective  of  hard  analysis, 
based  on  numerical  experiment  results,  we  formulate  the  following  conjecture  for  a 
slower  decay  rate. 


Conjecture  4.12.  Let  P±,  . . Pn  be  a  set  of  points  in  the  unit  disk  on  the  plane 
and  d(i,j,k)  be  the  distance  of  the  point  Pi  to  the  line  connecting  other  two  points 
Pj  and  Pk,  1  <  i,j,k  <  n,  then 


(4.67) 


...  .  ,,  C 

mm  d(i,  7,  k)  <  — r- — 
1  <i,j,k<n  v  '  -  n1+e° 


for  some  absolute  constant  C,  C  >  0,  independent  of  n  and  some  £q  >  0. 


5.  Numerical  Experiments 

In  this  section,  we  would  like  to  present  some  numerical  experimental  results. 

In  the  first  and  second  numerical  experiments,  we  use  MATLAB  to  randomly 
generate  n  points  in  a  unit  square  [0,1] 2  whose  two  coordinates  are  independent 
and  identically  distributed  copies  of  uniformly  distributed  random  variables  and 
then  compute  the  minimal  the  distance  of  a  point  to  the  line  connecting  other  two 
points.  For  each  matrix  size  n,  we  repeat  this  procedure  ?r2times  to  include  n2  sets 
of  points  of  size  n,  and  then  take  the  maximum  of  the  minimal  distance  over  the  n2 
repeats  of  randomly  generating  n  points,  due  to  the  configurations  increase  greatly 
as  the  size  of  the  point  increases.  After  that,  we  multiply  the  maximum  of  the 
minimal  distance  by  n2  to  compare  the  decay  rate  with  From  the  figure  Figure 
5.1a  on  page  18  and  Figure  5.1b  on  page  18,  we  can  see  that  n2  mini <i,j,k<n  d(i,j ,  k) 
is  bounded,  as  n  increases,  so  mini <i,j,k<nd(i,j,k)  decays  mostly  at  at  the  order 
of  at  least  O  (^2)  if  the  points  are  generated  by  normal  random  variables. 

In  the  third  and  fourth  numerical  experiments,  we  use  MATLAB  to  randomly 
generate  n  points  in  a  unit  square  [0,1] 2  whose  two  coordinates  are  independent 
and  identically  distributed  copies  uniformly  distributed  random  variables  and  then 
compute  the  minimal  the  distance  of  a  point  to  the  line  connecting  other  two  points. 
For  each  matrix  size  n,  we  repeat  this  procedure  n2  times  to  include  n2  sets  of  points 
of  size  n,  and  then  take  the  maximum  of  the  minimal  distance  over  the  n2  repeats 
of  randomly  generating  n  points,  due  to  the  configurations  increase  greatly  as  the 
size  of  the  point  increases.  After  that,  we  multiply  the  maximum  of  the  minimal 
distance  by  n3  to  compare  the  decay  rate  with  From  the  figure  Figure  5.2a 
on  page  19  and  Figure  5.2b  on  page  19,  we  can  see  that  n3  mmi<ijtk<n  d(i,  j,  k)  is 
bounded,  as  n  increases,  so  d(i,j,  k)  decays  with  high  probability  at  at 

the  order  of  O  (^3)  mostly  if  the  points  are  generated  by  normal  random  variables. 


77 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


DECAY  OF  THE  SMALLEST  SINGULAR  VALUES  OF  SUBMATRICES 


18 


(a)  Size  of  the  point  sets  up  to  30  (b)  Size  of  the  point  sets  up  to  40 

Figure  5.1.  Plotted  above  are  the  smallest  distances  to  lines  mul¬ 
tiplied  by  the  square  of  the  sizes  of  point  sets,  in  which  the  two 
coordinates  of  points  are  independent  and  identically  distributed 
copies  of  uniformly  distributed  random  variables 


In  the  fifth  numerical  experiment,  we  use  MATLAB  to  randomly  generate  n 
points  in  a  unit  square  [0,  l]2  whose  two  coordinates  are  independent  and  identically 
distributed  copies  uniformly  distributed  random  variables  and  then  compute  the 
minimal  the  distance  of  a  point  to  the  line  connecting  other  two  points.  For  each 
matrix  size  n,  we  repeat  this  procedure  80  times  to  include  n2  sets  of  points  of 
size  n ,  and  then  take  the  maximum  of  the  minimal  distance  over  the  80  repeats  of 
randomly  generating  n  points,  due  to  the  configurations  increase  greatly  as  the  size 
of  the  point  increases.  After  that,  we  multiply  the  maximum  of  the  minimal  distance 
by  n 3  to  compare  the  decay  rate  with  — j.  From  Figure  5.3a  on  page  20,  we  can  see 
that  n3  d(i,  j,  k)  is  bounded,  as  n  increases,  so  mini <i,j,k<nd(i,j,k) 

decays  with  high  probability  at  at  the  order  of  O  ( J:! )  mostly  if  the  points  are 
generated  by  normal  random  variables.  In  the  sixth  numerical  experiment,  we 
use  MATLAB  to  randomly  generate  n  points  in  a  unit  square  [0,1] 2  whose  two 
coordinates  are  independent  and  identically  distributed  copies  uniformly  distributed 
random  variables  and  then  compute  the  minimal  the  distance  of  a  point  to  the  line 
connecting  other  two  points.  For  each  matrix  size  n,  we  repeat  this  procedure  100 
times  to  include  100  sets  of  points  of  size  n,  and  then  take  the  maximum  of  the 
minimal  distance  over  the  n2  repeats  of  randomly  generating  n  points,  due  to  the 
configurations  increase  greatly  as  the  size  of  the  point  increases.  After  that,  we 
multiply  the  maximum  of  the  minimal  distance  by  n3  to  compare  the  decay  rate 


78 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


DECAY  OF  THE  SMALLEST  SINGULAR  VALUES  OF  SUBMATRICES 


19 


(a)  Size  of  the  point  sets  up  to  30  (b)  Size  of  the  point  sets  up  to  40 

Figure  5.2.  Plotted  above  are  the  smallest  distances  to  lines  mul¬ 
tiplied  by  the  square  of  the  sizes  of  point  sets,  in  which  the  two 
coordinates  of  points  are  independent  and  identically  distributed 
copies  uniformly  distributed  random  variables. 


with  -Tf.  From  Figure  5.3b  on  page  20,  we  can  see  that  n3  mini<,y  fc<n  d(i,  j,  k)  is 
bounded,  as  n  increases,  so  mini <i,j,k<n  k)  decays  with  high  probability  at  at 
the  order  of  O  (-V)  mostly  if  the  points  are  generated  by  normal  random  variables. 
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A  DISTRIBUTED  AND  INCREMENTAL  SVD  ALGORITHM  FOR 

2  AGGLOMERATIVE  DATA  ANALYSIS  ON  LARGE  NETWORKS 

3  M.  A.  IWEN*  AND  B.  W.  ONGt 

Abstract.  In  this  paper  it  is  shown  that  the  SVD  of  a  matrix  can  be  constructed  efficiently  in 

5  a  hierarchical  approach.  The  proposed  algorithm  is  proven  to  recover  the  singular  values  and  left 

6  singular  vectors  of  the  input  matrix  A  if  its  rank  is  known.  Further,  the  hierarchical  algorithm  can 

7  be  used  to  recover  the  d  largest  singular  values  and  left  singular  vectors  with  bounded  error.  It  is 

8  also  shown  that  the  proposed  method  is  stable  with  respect  to  roundoff  errors  or  corruption  of  the 

9  original  matrix  entries.  Numerical  experiments  validate  the  proposed  algorithms  and  parallel  cost 
10  analysis. 

Key  words.  Singular  value  decomposition;  low-rank  approximations;  distributed  computing; 

12  incremental  SVD 

13  AMS  subject  classifications.  15-A23,  65-F20 

1.  Introduction.  The  singular  value  decomposition  (SVD)  of  a  matrix, 

(1)  A  =  UTjV*, 

17  has  applications  in  many  areas  including  principal  component  analysis  [13],  the  so- 

18  lution  to  homogeneous  linear  equations,  and  low-rank  matrix  approximations.  If  A 

19  is  a  complex  matrix  of  size  D  x  A,  then  the  factor  U  is  a  unitary  matrix  of  size 

20  D  x  D  whose  first  nonzero  entry  in  each  column  is  a  positive  real  number,  1  E  is  a 

21  rectangular  matrix  of  size  D  x  N  with  non-negative  real  numbers  (known  as  singular 

22  values)  ordered  from  largest  to  smallest  down  its  diagonal,  and  V*  (the  conjugate 

23  transpose  of  V )  is  also  a  unitary  matrix  of  size  N  x  N.  If  the  matrix  A  is  of  rank 

24  d  <  min (D,N),  then  a  reduced  SVD  representation  is  possible: 

H  (2)  A  =  U±V*, 

27  where  E  is  a  dx  d  diagonal  matrix  with  positive  singular  values,  U  is  an  D  x  d  matrix 

28  with  orthonormal  columns,  and  V  is  a  d  x  N  matrix  with  orthonormal  columns. 

The  SVD  of  A  is  typically  computed  in  three  stages:  a  bidiagonal  reduction  step, 

30  computation  of  the  singular  values,  and  then  computation  of  the  singular  vectors. 

31  The  bidiagonal  reduction  step  is  computationally  intensive,  and  is  often  targeted  for 

32  parallelization.  A  serial  approach  to  the  bidiagonal  reduction  is  the  Golub-Kahan 

33  bidiagonalization  algorithm  [9],  which  reduces  the  matrix  A  to  an  upper-bidiagonal 

34  matrix  by  applying  a  series  of  Householder  reflections  alternately,  applied  from  the 

35  left  and  right.  Low-level  parallelism  is  possible  by  distributing  matrix- vector  multi- 

36  plies,  for  example  by  using  the  cluster  computing  framework  Spark  [21].  Using  this 

37  form  of  low-level  parallelism  for  the  SVD  has  been  implemented  in  the  Spark  project 
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Research  @  MSU. 

'Department  of  Mathematical  Sciences,  Michigan  Technological  University,  Houghton,  MI 
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■’■This  last  condition  on  U  guarantees  that  the  SVD  of  A  E  CNxN  will  be  unique  whenever  A  A* 
has  no  repeated  eigenvalues. 
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MLlib  [18],  and  Magma  [1],  which  develops  its  own  framework  to  leverage  GPU  ac¬ 
celerators  and  hybrid  manycore  systems.  Alternatively,  parallelization  is  possible  on 
an  algorithmic  level;  it  is  possible  by  applying  independent  reflections  simultaneously, 
for  example  [15]  maps  the  bidiagonalization  algorithm  to  graphical  processing  (GPU) 
units,  and  [14]  executes  the  bidiagonalization  on  a  distributed  cluster.  Load  balancing 
is  an  issue  for  such  parallel  algorithms  however,  because  the  number  of  off-diagonal 
columns  (or  rows)  to  eliminate  get  successively  smaller.  More  recently,  two-stage  ap¬ 
proaches  have  been  proposed  and  utilized  in  high-performance  implementations  for 
the  bidiagonal  reduction  [16,  10].  The  first  stage  reduces  the  original  matrix  to  a 
banded  matrix,  the  second  stage  subsequently  reduces  the  banded  matrix  to  the  de¬ 
sired  upper-bidiagonal  matrix.  A  heroic  effort  to  optimize  the  algorithms  to  hide 
latency  and  cache  misses  was  discussed  and  implemented  [10].  Parallelization  is  also 
possible  if  one  uses  a  probabilistic  approach  to  approximating  the  SVD  [11]. 

In  this  paper,  we  are  concerned  with  finding  the  SVD  of  highly  rectangular  matri¬ 
ces,  N  D.  In  many  applications  where  such  problems  are  posed,  one  typically  cares 
about  the  singular  values  and  the  left  singular  vectors.  For  example,  this  work  was 
motivated  by  the  SVDs  required  in  Geometric  Multi-Resolution  Analysis  (GMRA) 
[2];  the  higher-order  singular  value  decomposition  (HOSVD)  [7]  of  a  tensor  requires 
the  computation  of  n  SVDs  of  very  rectangular  matrices,  where  n  is  the  number  of 
tensor  modes.  Similarly,  tensor  train  factorization  algorithms  [19]  for  tensors  require 
the  computation  of  many  very  rectangular  SVDs.  In  fact,  the  SVDs  of  distributed 
and  highly  rectangular  matrices  of  data  appear  in  many  big-date  era  machine  learning 
applications.  To  find  the  SVD  of  highly  rectangular  matrices,  many  methods  have 
focused  on  randomized  techniques;  [17]  provides  a  recent  survey  of  these  techniques. 

Alternatively,  one  can  take  an  incremental  approach  to  computing  the  SVD  of  an 
input  matrix.  Such  methods  have  the  advantage  that  they  can  be  used  to  help  effi¬ 
ciently  analyze  datasets  which  (rapidly)  evolve  over  time.  Examples  of  such  methods 
include  [5] ,  which  computes  the  SVD  of  a  matrix  by  adding  one  column  at  a  time,  or 
more  generally,  one  can  add  blocks  of  a  matrix  at  each  time.  In  [4]  a  block-incremental 
approach  for  estimating  the  dominant  singular  values  and  vectors  of  a  highly  rectan¬ 
gular  matrix  is  described.  It  is  based  on  a  QR  factorization  of  blocks  from  the  input 
matrix,  which  can  be  done  efficiently  in  parallel.  In  fact,  the  QR  decomposition  can 
be  computed  using  a  communication-avoiding  QR  (CAQR)  factorization  [8],  which 
utilizes  a  tree-reduction  approach. 

Our  approach  is  similar  in  spirit  to  [8],  but  differs  in  that  we  utilize  a  block 
decomposition  approach  that  utilizes  a  partial  SVD  rather  than  a  full  QR  factoriza¬ 
tion.  This  is  advantageous  if  the  application  only  requires  the  singular  values  and/or 
left  singular  vectors  as  in  tensor  factorization  [7,  19]  and  GMRA  applications  [2]. 
Another  approach  would  be  to  compute  the  eigenvalue  decomposition  of  the  Gram 
matrix,  AA*  [3].  Although  computing  the  Gram  matrix  in  parallel  is  straightforward 
using  the  block  inner  product,  a  downside  to  this  approach  is  a  loss  of  numerical 
precision,  and  the  general  availability  of  the  entire  matrix  A ,  which  one  may  not  have 
easy  access  to  (i.e. ,  computation  of  the  Gram  matrix,  AA*,  is  not  easily  achieved  in 
an  incremental  and  distributed  setting). 

The  remainder  of  the  paper  is  laid  out  as  follows:  In  Section  2,  we  motivate 
incremental  approaches  to  constructing  the  SVD  before  introducing  the  hierarchical 
algorithm.  Theoretical  justifications  are  given  to  show  that  the  algorithm  exactly 
recovers  the  singular  values  and  left  singular  vectors  if  the  rank  of  the  matrix  A  is 
known.  An  error  analysis  is  also  used  to  show  that  the  hierarchical  algorithm  can  be 
used  to  recover  the  d  largest  singular  values  and  left  singular  vectors  with  bounded 
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error,  and  that  the  algorithm  is  stable  with  respect  to  roundoff  errors  or  corruption  of 
the  original  matrix  entries.  In  Section  3,  numerical  experiments  validate  the  proposed 
algorithms  and  parallel  cost  analysis. 

2.  An  Incremental  (hierarchical)  SVD  Approach.  The  overall  idea  behind 
the  proposed  approach  is  relatively  simple.  We  require  a  distributed  and  incremental 
approach  for  computing  the  singular  values  and  left  singular  vectors  of  all  data  stored 
across  a  large  distributed  network.  This  can  be  achieved  by,  e.g.,  performing  an 
incremental  partial  SVD  separately  on  each  network  node  by  occasionally  combining 
each  node’s  previously  computed  partial  SVD  representation  of  its  past  data  with  a 
new  partial  SVD  of  its  more  recent  data.  The  result  of  this  approach  will  be  that 
each  separate  network  node  always  contains  a  fairly  accurate  approximation  of  it’s 
cumulative  data  over  time.  Of  course,  these  separate  partial  SVDs  must  then  be 
merged  together  in  order  to  understand  the  network  data  as  a  whole.  Toward  this 
end,  neighboring  node’s  partial  SVD  approximations  can  be  combined  hierarchically 
in  order  to  compute  a  global  partial  SVD  of  the  data  stored  across  the  entire  network. 

Note  that  the  accuracy  of  the  entire  approach  will  be  determined  by  the  accuracy 
of  the  (hierarchical)  partial  SVD  merging  technique,  which  is  ultimately  what  leads 
to  the  proposed  method  being  both  incremental  and  distributed.  Theoretical  analysis 
of  this  partial  SVD  merging  technique  is  the  primary  purpose  of  this  section.  In 
particular,  we  prove  the  proposed  partial  SVD  merging  scheme  is  both  numerically 
robust  to  data  and/or  roundoff  errors,  and  also  accurate  even  when  the  rank  of  the 
overall  data  matrix  A  is  underestimated  and/or  purposefully  reduced. 

2.1.  Mathematical  Preliminaries.  Let  A  £  C D x  -:V  be  a  highly  rectangular 
matrix,  with  N  D.  Further,  let  A*  £  CDxNi  with  i  =  1,2, ,  M,  denote  the  block 
decomposition  of  A,  i.e. ,  A  =  [A^A2!  •  ■  •  |AM] . 

Definition  1.  For  any  matrix  A  £  CDxN,  ( A)d  £  <CDxN  is  an  optimal  rank  d 
approximation  to  A  with  respect  to  Frobenius  norm  ||  ■  ||f  if 

inf  III?  —  A\\p  =  II (AD  —  A||f,  subject  to  rank(H)  <  d. 

BecDxN 

Further,  if  A  has  the  SVD  decomposition  A  =  ITEV* ,  then  (A)^  =  X^=i  uiaivi  > 
where  Ui  and  Vi  are  singular  vectors  that  comprise  U  and  V  respectively,  and  cq  are 
the  singular  values. 

This  first  lemma  proves  that  partial  SVDs  of  blocks  of  our  original  data  matrix, 
A  £  CDxAr,  can  combined  block-wise  into  a  new  reduced  matrix  B  which  has  the 
same  singular  values  and  left  singular  vectors  as  the  original  A.  This  basic  lemma  can 
be  considered  as  the  simplest  merging  method  for  either  constructing  an  incremental 
SVD  approach  (different  blocks  of  A  have  their  partial  SVDs  computed  at  different 
times,  which  are  subsequently  merged  into  B ),  a  distributed  SVD  approach  (different 
nodes  of  a  network  compute  partial  SVDs  of  different  blocks  of  A  separately,  and  then 
send  them  to  a  single  master  node  for  combination  into  B),  or  both. 

Lemma  2.  Suppose  that  A  £  <CDxN  has  rank  d  £  {1  and  let  A 1  £ 

CDxNiji  =  1,2 ,.. .  ,M  be  the  block  decomposition  of  A,  i.e.,  A  =  [A^A2!  •  •  •  |AA/] . 
Since  A1  has  rank  at  most  d,  each  block  has  a  reduced  SVD  representation, 

d 

A‘  =  J2  =  U't'V1*,  i  =  1, 2, . . . ,  M. 

I=i 
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Let  B 


U1t1\U2t2\  ■■  ■  I uMtM 


If  A  has  the  reduced  SVD  decomposition,  A  = 


UYiV* ,  and  B  has  the  reduced  SVD  decomposition,  B  =  U'Y,'V'  ,  then  E  =  E ' ,  and 
U  =  U'W ,  where  W  is  a  unitary  block  diagonal  matrix  satisfying  U  =  U'W .  If  none 
of  the  nonzero  singular  values  are  repeated  then  U  =  U'  (i.e.,W  is  the  identity  when 
all  the  nonzero  singular  values  of  A  are  unique). 


Proof.  The  singular  values  of  A  are  the  (non-negative)  square  root  of  the  eigen¬ 
values  of  AA* .  Using  the  block  definition  of  A, 


MM  M 

aa*  =  J2  A\ A T  =  J2  uifii{vi)*(yi){T,i)*{ui)*  =  Y  &&(&)*(&)* 

2=  1  2=1  2=1 


Similarly,  the  singular  values  of  B  are  the  (non-negative)  square  root  of  the  eigenvalues 
of  BB*. 

M  M 

BB*  =  ^(LPE  *)(&£*)*  =Yuiti(ti)*(Ui)* 

Since  AA*  =  BB*,  the  singular  values  of  B  must  be  the  same  as  the  singular  values 
of  A.  Similarly,  the  left  singular  vectors  of  both  A  and  B  will  be  eigenvectors  of 
AA*  and  BB*,  respectively.  Since  AA*  =  BB*  the  eigenspaces  associated  with  each 
(possibly  repeated)  eigenvalue  will  also  be  identical  so  that  U  =  U'W.  The  block 
diagonal  unitary  matrix  W  (with  one  unitary  h  x  h  block  for  each  eigenvalue  that  is 
repeated  h-times)  allows  for  singular  vectors  associated  with  repeated  singular  values 
to  be  rotated  in  the  matrix  representation  U.  □ 

We  now  propose  and  analyze  a  more  useful  SVD  approach  which  takes  the  ideas 
present  in  Lemma  2  to  their  logical  conclusion. 

2.2.  An  Incremental  (Hierarchical)  SVD  Algorithm.  The  idea  is  to  lever¬ 
age  the  result  in  Lemma  2  by  computing  (in  parallel)  the  SVD  of  the  blocks  of  A, 
concatenating  the  scaled  left  singular  vectors  of  the  blocks  to  form  a  proxy  matrix  B, 
and  then  finally  recovering  the  singular  values  and  left  singular  vectors  of  the  original 
matrix  A  by  finding  the  SVD  of  the  proxy  matrix.  A  visualization  of  these  steps  are 
shown  in  Figure  1.  Provided  the  proxy  matrix  is  not  very  large,  the  computational  and 
memory  bottleneck  of  this  algorithm  is  in  the  simultaneous  SVD  computation  of  the 
blocks  A1.  If  the  proxy  matrix  is  sufficiently  large  that  the  computational/memory 
overhead  is  significant,  a  multi-level  hierarchical  generalization  is  possible  through 
repeated  application  of  Lemma  2.  Specifically,  one  could  generate  multiple  proxy  ma¬ 
trices  by  concatenating  subsets  of  scaled  left  singular  vectors  obtained  from  the  SVD 
of  blocks  of  A,  find  the  SVD  of  the  proxy  matrices  and  concatenate  those  singular 
vectors  to  form  a  new  proxy  matrix,  and  then  finally  recover  the  singular  values  and 
left  singular  vectors  of  the  original  matrix  A  by  finding  the  SVD  of  the  proxy  ma¬ 
trix.  A  visualization  of  this  generalization  is  shown  in  Figure  2  for  a  two-level  parallel 
decomposition.  A  general  q- level  algorithm  is  described  in  Algorithm  1. 

2.3.  Theoretical  Justification.  In  this  section  we  will  introduce  some  addi¬ 
tional  notation  for  the  sake  of  convenience.  For  any  matrix  A  £  CDxN  with  SVD 
A  =  17EV*,  we  will  let  A  :=  U E  =  AV  £  CDxN.  2  Given  this  notation  Lemma  2 
can  be  rephrased  as  follows: 


2It  is  important  to  note  that  A  is  not  necessarily  uniquely  determined  by  A  if,  e.g.,  A  is  rank 
deficient  and/or  has  repeated  singular  values.  In  these  types  of  cases  many  pairs  of  unitary  U  and 
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Fig.  1.  Flowchart  for  a  simple  (one-level)  distributed  parallel  SVD  algorithm.  The  different 
colors  represent  different  processors  completing  operations  in  parallel. 


Algorithm  1  A  g-level,  distributed  SVD  Algorithm  for  Highly  Rectangular  A  £ 

CDxN,  A»D. 


Input:  q  (#  levels), 

n  (#  local  SVDs  to  concatenate  at  each  level), 
d  £  {1, . . . ,  D}  (intrinsic  dimension), 

A1,1  :=  A1  £  CDxNi  for  i  =  1,2,  ...,M  (block  decomposition  of  A;  algorithm 
assumes  M  =  nq  -  generalization  is  trivial) 

Output:  U'  £  CDxd  S3  the  first  d  columns  of  U,  and  £'  £  Rdxd  ss  (E)d. 

1:  for  p  =  1, . . . ,  q  do 

2:  Compute  (in  parallel)  the  SVDs  of  Apd  =  (Vpd)*  for  i  = 

1,  2, . . . ,  M/nSp~1\  unless  the  Up,lTip’1  are  already  available  from  a  previous 


3: 

4: 

5: 

6: 


run. 

Set  Ap+1’1  := 

1,2,...,  M/np. 


l)n+l^p,(i—  l)n+lj 


end  for 

Compute  the  SVD  of  Ag+1,1 

Set  IT  :=  the  first  d  columns  of  Uq+1,1,  and  £'  := 


^JJP  ,in\^P  ,in^j 


£’*Ad- 


for  i  = 


Corollary  1.  Suppose  that  A  £  <CDxN  has  rank  d  £  {1, . . .  ,D},  and  let  A 1  £ 
CDxNi,i  =  1,2 ,.. .  ,M  be  the  block  decomposition  of  A,  i.e.,  A  =  [A^A2!  •  •  •  |AM] . 
Since  A*  has  rank  at  most  d  for  all  i  =  1,2, ... ,  M ,  we  have  that 


(A)d 


A  = 


A1  I  A2  I  •  •  •  I  AM 


{Al)d  |  (A2)d  |  •••  |  (AM)d 


175 


V  may  appear  in  a  valid  SVD  of  A.  Below,  one  can  consider  A  to  be  AV  for  any  such  valid  unitary 
matrix  V.  Similarly,  one  can  always  consider  statements  of  the  form  A  =  B  as  meaning  that  A  and 
B  are  equivalent  up  to  multiplication  by  a  unitary  matrix  on  the  right.  This  inherent  ambiguity  does 
not  effect  the  results  below  in  a  meaningful  way. 
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Fig.  2.  Flowchart  for  a  two-level  hierarchical  parallel  SVD  algorithm.  The  different  colors 
represent  different  processors  completing  operations  in  parallel. 


We  can  now  prove  that  Algorithm  1  is  guaranteed  to  recover  A  when  the  rank  of 
177  A  is  known.  The  proof  follows  by  inductively  applying  Corollary  1. 

Theorem  1.  Suppose  that  A  £  i^dxn  rani;  ^  g  {1,  Then,  Algo- 

179  rithm  1  is  guaranteed  to  recover  an  A9+1,1  £  CDxN  such  that  A.9+1’1  =  A. 

Proof.  We  prove  the  theorem  by  induction  on  the  level  p.  To  establish  the  base 
case  we  note  that 


A  = 

"3 

<N 

s 

■  |  (A^)d 

= 

A1'1  j  A1-2  j  ■ 

■  |  A' A' 

holds  by  Corollary  1.  Now,  for  the  purpose  of  induction,  suppose  that 


A  = 


(APd)d  |  (AP’2)d  |  •••  |  (AP’M/n(p-1,)d 


J^P,1  I  ^.p,2  I  ...  I 


180  holds  for  some  some  p  £  {1, . . .  ,q}.  Then,  we  can  use  the  induction  hypothesis  and 

181  repartition  the  blocks  of  A  to  see  that 


182 

183 


A  = 


{APd)d  |  ( AP’2)d  |  ■■■  |  (AP’M/"(p-1))d 
. . .  [(Ap.b-i)"+i)d  . . .  (AP’™)d\  ■  ■  ■ 


1, . . .  ,M/np 


(3)  =  [AP+1’1  |  Ap+1'2  |  ■  ■  ■  |  Ap+1’m/uP ] , 
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where  we  have  utilized  the  definition  of  Ap+1’4  from  line  3  of  Algorithm  1  to  get  (3). 
Applying  Corollary  1  to  the  matrix  in  (3)  now  yields 


A  = 


J\p+ 1,1  I  j\P+ 1)2  I  ...  I  j^p-\-l,M/nP 


Finally,  we  finish  by  noting  that  each  Ap+ L*  will  have  rank  at  most  d  since  A  is  of 


rank  d.  Hence,  we  will  also  have  A 
finishing  the  proof. 


(AP+  l.l)d  |  (AP+  b2)rf  |  ■■■  |  (AP+bM/n*)^ 

□ 


Our  next  objective  is  to  understand  the  accuracy  of  Algorithm  1  when  it  is  called 
with  a  value  of  d  that  is  less  than  rank  of  A.  To  begin  we  need  a  more  general  version 
of  Lemma  2. 


Lemma  3.  Suppose  A 1  £  CDxw'  ,i  =  1,2, ,  M .  Further,  suppose  matrix  A  has 
block  components  A  =  [A-’jA2!  ■  ■  •  |AM] ,  and  B  has  block  components  B  =  [(A1)d|(A2)£;|  •  •  •  |(AM)d]  .| 
Then,  || (B)d  -  A|| F  <  ||(H)d  -  H||F  +  || B  -  A||F  <  3||(A)d  -  A||F  holds  for  all 
d£{l,...,D}. 

Proof.  We  have  that 


\\(B)d-A\\F<\\(B)d-B\\F  +  \\B-A\\F 

<  11(A)* -B||P  +  ||B-A||F 

<  11(A)*  —  A||p  +  2|| B  —  A||f 


Now  letting  {A)ld  £  C  DxNi,i  =  1,  2 ,...  ,M  denote  the  «th  block  of  (A)d,  we  can  see 
that 

M 

||H-A|||  =  ^||(Ai)d-Ai||2 

i=  1 
M 

<  ^  ||(A)' -A^ll2 

i= 1 

=  ll(^)d- ^Ill- 


Combining  these  two  estimates  now  proves  the  desired  result.  □ 

We  can  now  use  Lemma  3  to  prove  a  theorem  that  that  will  help  us  to  bound 
the  error  produced  by  Algorithm  1  when  d  is  chosen  to  be  less  than  the  rank  of  rank 
of  A.  It  improves  over  Lemma  3  (in  our  setting)  by  not  implicitly  assuming  to  have 
access  to  any  information  regarding  the  right  singular  vectors  of  the  blocks  of  A.  It 
also  demonstrates  that  the  proposed  method  is  stable  with  respect  to  additive  errors 
by  allowing  (e.g.,  roundoff)  errors,  represented  by  T,  to  corrupt  the  original  matrix 
entries.  Note  that  Theorem  2  is  a  strict  generalization  of  Corollary  1.  Corollary  1  is 
recovered  from  it  when  'F  is  chosen  to  be  the  zero  matrix,  and  d  is  chosen  to  be  the 
rank  of  A. 


-iDxNi 


Theorem  2.  Suppose  that  A  £  CVxN  has  block  components  A1  £  C 
1,2,...,  M,  so  that  A  =  [AX|A2|  ■  •  ■  |AM] .  Let  B  =  |"(A1)ci  |  (A2)d  \  |  (AM)0 

\F  £  CDxN ,  and  B'  =  B  +  T .  Then,  there  exists  a  unitary  matrix  W  such  that 

jB^Yd  —  Aw||f  <  3^2)1  (A)d  -  A||f  +  (l  +  V2)  ||T||F 

holds  for  all  d  £  {1, ... ,  D}. 


88 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 

This  manuscript  is  for  review  purposes  only. 


M.  A.  IWEN  AND  B.  W.  ONG 


218 

219 

220 

221 
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228 


229 

230 

231 

232 

233 

234 

235 

236 

237 

238 


Proof.  Let  A!  = 


A1  A2 


am 


.  Note  that  A'  =  A  by  Corollary  1.  Thus, 


there  exists  a  unitary  matrix  W"  such  that  A'  =  AW" .  Using  this  fact  in  combination 
with  the  unitary  invariance  of  the  Frobenius  norm,  one  can  now  see  that 


\\(B')d-A%  =  \\(B')d-AW" 


(B%  ~  AW' 


(B')d  ~  AW 


for  some  unitary  matrixes  W'  and  W.  Hence,  it  suffices  to  bound  \\{B')d  —  A'||F. 
Proceeding  with  this  goal  in  mind  we  can  see  that 


<  \\{B')d  —  -B'llp  +  II-®'  —  B\\f  +  ||.B  —  A'Hf 


D 


\  j=d+l 


|*||F  +  ||B-A'||F 


.  X]  ad+2j-l(B  +  ^0  +  ad+2j(B  +  +  II^IIf  +  \\B  -  -A'IIf 

\  i=l 


< 


r^i 

.  X/  (ad+j(B)  +  cxJ('I'))2  +  (ad+j(B)  +  fjJ+1(\l/))2  +  ||’P||F  +  \\B  -  A'\ 

\  i= i 


where  the  last  inequality  results  from  an  application  of  Weyl’s  inequality  to  the  first 
term  (see,  e.g.,  Theorem  3.3.16  in  [12]).  Utilizing  the  triangle  inequality  on  the  first 
term  now  implies  that 


\(B')d-A%  < 


D 


,  e 

\  j=d+ 1 


D 


Y/^2(^)  +  mF+\\B-A'\ 

\  3= 1 


<  V2 (||(H)d  —  5||f  +  \\B  —  A'||f)  +  (l  +  V2)  ||tt||F. 


Applying  Lemma  3  to  bound  the  first  two  terms  now  concludes  the  proof  after  noting 
that  ||(A')d  —  A'Hf  =  ||(A)d  -  A||F.  □ 

This  final  theorem  bounds  the  total  error  of  Algorithm  1  with  respect  to  the  true 
matrix  A  up  to  right  multiplication  by  a  unitary  matrix.  The  structure  of  its  proof  is 
similar  to  that  of  Theorem  1. 

Theorem  3.  Let  A  £  <CDxN  and  q  >  1.  Then,  Algorithm  1  is  guaranteed  to 
recover  an  Aq+ 1,1  £  CDxN  such  that  (A^+b1)^  =  AW  +  'L,  where  W  is  a  unitary 

matrix,  and  ||’P||f  <  ^(l  +  V/2)<?+  —  1^  ||(A)d  —  A||p. 

Proof.  Within  the  confines  of  this  proof  we  will  always  refer  to  the  approximate 
matrix  Ap+ 1,1  from  line  3  of  Algorithm  1  as 


_gp+M  .= 


(F?pU-i)«+i) 


(Bp’in)c 


for  p  =  1, . . . ,  q,  and  i  =  1, . . . ,  M/np .  Conversely,  A  will  always  refer  to  the  original 
(potentially  full  rank)  matrix  with  block  components  A  =  [A^A2!  •  •  •  |AM],  where 
M  =  nq.  Furthermore,  Ap will  always  refer  to  the  error  free  block  of  the  original 


89 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 

This  manuscript  is  for  review  purposes  only. 


A  DISTRIBUTED  AND  INCREMENTAL  SVD  ALGORITHM 


9 


matrix  A  whose  entries  correspond  to  the  entries  included  in  Bp'1.  3  Thus,  A 
AP,1|AP,2|  ■  ■  ■  |Ap,M/"<P  ’  holds  for  all  p  =  1, . . . ,  q  +  1,  where 


AP+ ii*  ^p,(i-i)n+i  . . .  j\p 


239  for  all  p  =  1  and  i  =  1 , . . . ,  M/np .  For  p  =  1  we  have  B 1,1  =  A1  =  A 1,1 

240  for  i  =  1  by  definition  as  per  Algorithm  1.  Our  ultimate  goal  is  to  bound 

241  the  renamed  ( Bi+1'1)d  matrix  from  lines  5  and  6  of  Algorithm  1  with  respect  to  the 

242  original  matrix  A.  We  will  do  this  by  induction  on  the  level  p.  More  specifically,  we 

243  will  prove  that 

244  1.  (BP’i)d  =  Ap’lWp’i  +  p >*,  where 

245  2.  Wp >*  is  always  a  unitary  matrix,  and 

246  3.  II^IIf  <  ((1  +  V2)P  -  l)  || (AP’%  -  AP’%, 

247  holds  for  all  p  =  1, . . . ,  q  +  1,  and  i  =  1, . . . ,  M/n('p~1') . 

Note  that  conditions  1  —  3  above  are  satisfied  for  p  =  1  since  B1’1  =  A 1  =  A1’1 
for  alii  =  1, . . . ,  M  by  definition.  Thus,  there  exist  unitary  W1’1  for  all  *  =  1, ... ,  M 
such  that 


{B1’i)d  =  (AM)d  =  (A1-4)  WM  =  A1’iW1,i  +  ((AM)d  -  A1’*)  W1’i, 


248  where  T1’*  :=  ((Alil)d  -  A1,4)  W1’1  has  H^Hf  =  ||  (A1,i)d  -  A1** ||  <  ||  (Alp)d  -  A1 

249  Now  suppose  that  conditions  1  —  3  hold  for  some  p  £  {1, . . . ,  q}.  Then,  one  can 

250  see  from  condition  1  that 


251  Bp+1,i  := 

252  = 

253  = 


254 

255 


256  where  := 


(F?p.U-  1)n+1)d  •••  ( BP’in)d 

^p,(i-l)n+l^p,(i-l)n+l  ,  ^p,(i-l)n+l 


^p,(i~l)ri+l^/p,(i-l)n+l 


Ap,i"Wp’in 


^p,(i-l)n+l 

^,p,(i-l)n+l 


Ap,i 


W  +  VP, 


257 


W 


AP,inyyp,in  +  ^p.i 
^p,(i-l)n+l 


. . .  typ,™ 

^  WP>(*-i)»H-i 

and 

0 

0 

0  ^ 

0 

WP<(i~  l)n+2 

0 

0 

0 

0 

0 

\  0 

0 

0 

Wp’m  / 

258 

259 

260 


Note  that  W  is  unitary  since  its  diagonal  blocks  are  all  unitary  by  condition  2.  There¬ 
fore,  we  have  Bp+1’1  =  Ap+1’lW  +  ik. 


We  may  now  bound 


(BP+  1’i)d 


Ap+^W 


using  a  similar  argument  to  that 

F 


3That  is,  Bp’x  is  used  to  approximate  the  singular  values  and  left  singular  vectors  of  Ap-'  for  all 
p  =  1, . . . ,  q  +  1,  and  i  =  1, . . . ,  M/nv~ 1 
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261 

262 

263 

264 

265 

266 

267 

268 

269 

270 

271 

272 


employed  in  the  proof  of  Theorem  2. 


(Bp+1’i)d  -Ap+1’lW  <  \\(Bp+1’i)d-Bp+1’i\\  + 


BP+l,i  _  j^p+  l,iyy 


D 

\ 

E  aUAP+1’iW  +  ^)  +  ||#||F 

\ 

j-d+l 

D 

D 

-  \ 

E  (AP+^W)  + 

E2^2(§)  +  ||4-||f 

\ 

0=d+ 1  \ 

0= 1 

(4)  =  V2||A*’+1-<- (Ap+1'i)Jp+  ( 

\  +  Vy  ’F  F. 

Appealing  to  condition  3  in  order  to  bound  ||  tp||F  we  obtain 

n 

ii^Hf  =  Eii*p’(i_1)”+il<  (( 

'l  +  v/2)P-l)2E 

l)n~ej  j d  —  AP’ii-An+j 

j= 1 

*(( 

2 

\  +  V2)P  -  l)  E  (^P+M)d  -  Ap^~1)n+:> 

2 

5 

F 

where  (Ap+1’l)Jd  denotes  the  block  of  (Ap+1'l)d  corresponding  to  Ap,(l  1)Ii+i  for  j  = 
1 , ,n.  Thus,  we  have  that 


2  n 

II^IIf  <  M  -‘)£|  (Ap+1,iyd  —  j4P>(*_i)n+-j 

1=1 

274  (5)  =  ((l  +  V2)v  l)2  || (Ap+^)d  -  A^Wl . 

276  Combining  (4)  and  (5)  we  can  finally  see  that 


277 


(Bp+1’i)d-Ap+1'iW  <  V/2+(l  +  '/2)((l  +  \/2)P-l)  ||  (Ap+1,i)d  -  Ap+1,’| 


27s  (6) 

279  v  ’ 


^(i + v^y+1  - ii  (Ap+i'i)d  -  . 


i 


280 

281 

282 

283 

284 


Note  that 


(Bp+1’i)d  -  Ap+lpW  =  (Bp+1’i)d  -  Ap+l’iWp+l,i 


where  Wp+1’  1 


is  unitary.  Hence,  we  can  see  that  conditions  1-3  hold  for  p  +  1  with  4/p+1,1  := 
(BP+i^jd  -  Ap+1’iWp+1’i.  □ 

Having  proven  that  the  method  is  accurate  for  low  rank  A,  we  are  now  free  to 
consider  it’s  computational  costs. 


2.4.  Parallel  Cost  Model  and  Collectives.  To  analyze  the  parallel  comrnu- 

286  nication  cost  of  the  hierarchical  SVD  algorithm,  the  a  -  j3  -  j  model  for  distributed- 

287  memory  parallel  computation  [6]  is  used.  The  parameters  a  and  /3  respectively  rep- 

288  resent  the  latency  cost  and  the  transmission  cost  of  sending  a  “word”  between  two 

289  processors.  In  our  presentation,  a  word  will  refer  to  a  vector  of  doubles  in  IR-0,  i.e. , 

290  a  vector  of  size  D  x  1.  The  parameter  7  represents  the  time  for  one  floating  point 

291  operation  (FLOP). 
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The  g-level  hierarchical  Algorithm  1  seeks  to  find  the  d  largest  singular  values 
and  left  singular  vectors  of  a  matrix  A.  If  the  matrix  A  is  decomposed  into  M  =  nq 
blocks,  where  n  being  the  number  of  local  SVD’s  being  concatenated  at  each  level, 
the  send/receive  communication  cost  for  the  algorithm  is  is 

q  (a  +  d  (n  —  1)  /3) , 

assuming  that  the  data  is  already  distributed  on  the  compute  nodes  and  no  scatter 
command  is  required.  If  the  (distributed)  right  singular  vectors  are  needed,  then  a 
broadcast  of  the  left  singular  vectors  to  all  nodes  incurs  a  communication  cost  of 
a  +  d  M  /3. 

Suppose  Ais  a  DxN  matrix,  N  D.  The  sequential  SVD  is  typically  performed 
in  two  phases:  bidiagonalization  (which  requires  (2 N  D2  +  2D3)  flops)  followed  by 
diagonalization  (negligible  cost).  If  M  processing  cores  are  available  to  compute  the 
q- level  hierarchical  SVD  method  in  Algorithm  1,  and  the  matrix  A  is  decomposed 
into  M  =  nq  blocks,  where  n  is  again  the  number  of  local  SVD’s  being  concatenated 
at  each  level.  The  potential  parallel  speedup  can  be  approximated  by 

(7]  _ (2ND2  +  2D3)') _ 

1  '  7(2 (N/M)D2  +  2D3)  +  q(2dnD2  +  2D3)  +  q(a  +  d(n  -  l)/3) ' 

3.  Numerical  Validation.  In  the  first  experiment,  the  left  singular  vectors  and 
the  singular  values  of  a  matrix  A  (D  =  d  =  800,  N  =  1, 152, 000)  are  found.  We  utilize 
a  shared  memory  system  which  has  6  TB  of  memory  and  eight  sockets,  each  equipped 
with  a  twelve-core  Intel  E7-8857v2  processor,  for  a  total  of  96  processing  cores.  Since 
the  input  data  and  memory  storage  required  by  the  SVD  algorithms  fit  in  memory  on 
this  specialized  compute  node,  we  performed  a  strong  scaling  study  of  our  one-level 
distributed  SVD  algorithm,  benchmarked  against  the  LAPACK  SVD  routine,  dgesvd, 
implemented  in  the  threaded  Intel  MKL  library.  In  a  pre-processing  step,  the  matrix 
A  is  decomposed  with  each  block  of  A  stored  in  separate  HDF5  files,  hosted  on  a 
high-speed  Lustre  server  capable  of  6GB/s  read/write  i/o.  The  observed  speedup 
is  reported  in  Figure  3.  In  the  blue  curve,  the  observed  speedup  is  reported  for  a 
varying  number  of  MKL  worker  threads.  In  the  red  curve,  the  speedup  is  reported 
for  a  varying  number  of  worker  threads  i,  applied  to  an  appropriate  decomposition  of 
the  matrix.  Each  worker  uses  the  same  Intel  MKL  library  to  compute  the  SVD  of  the 
decomposed  matrices  (each  using  a  single  thread),  the  proxy  matrix  is  assembled,  and 
the  master  thread  computes  the  SVD  of  the  proxy  matrix  using  the  Intel  MKL  library, 
again  with  a  single  thread.  Each  numerical  experiment  is  run  four  times,  and  the 
average  walltime  used  to  compute  the  observed  speedup.  The  parallel  performance 
of  our  distributed  SVD  is  far  superior,  this  in  spite  of  the  fact  that  our  algorithm 
was  implemented  using  MPI  2.0  and  does  not  leverage  the  inter-node  communication 
savings  that  is  possible  with  newer  MPI  implementations.  The  dip  in  performance 
when  more  than  48  cores  are  used  is  likely  attributed  to  non-uniform  memory  access 
on  this  large  shared-memory  node. 

In  the  second  experiment,  we  perform  a  weak  scaling  study,  where  the  size  of 
the  input  matrix  A  is  varied  depending  on  the  number  of  worker  nodes,  A  =  2000  x 
32,  000M,  where  A I  is  the  number  of  compute  cores.  The  experiment  was  conducted 
on  a  shared  high-performance  cluster  (other  users  may  be  running  computationally 
intensive  processes  on  the  same  node,  communication  heavy  processes  on  the  network, 
or  i/o  heavy  processes  taxing  the  shared  file  systems),  leading  to  some  variability  in 
the  study.  Each  data  point  in  Figure  4  is  computed  using  the  average  walltime  from 
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— (MKL)  Actual  Speedup 
— Distributed  SVD 
Ideal  Speedup 


Fig.  3.  Strong  scaling  study  of  the  dgesvd  function  in  the  threaded  Intel  MKL  library  (blue) 
and  the  proposed  distributed  SVD  algorithm  (red).  The  input  matrix  is  of  size  800  X  1, 152,  000.  The 
slightly  “better  than  ideal ”  speedup  is  likely  due  to  better  utilization  of  cache  in  each  socket. 


340  five  numerical  experiments.  Additionally,  the  network  is  constructed  using  a  fat- 

341  tree  topology  that  is  oversubscribed  by  a  ratio  of  2:1,  resulting  in  further  variability 

342  based  on  the  compute  resources  that  were  allocated  for  each  numerical  experiment. 

343  The  theoretical  peak  efficiency  is  computed  using  equation  7,  assuming  negligible 
communication  overhead. 


n  =  2 

theoretical  peak,  n  =  2 
n  =  3 

theoretical  peak,  n  =  3 
n  =  4 

theoretical  peak,  n  =  4 


Fig.  4.  Weak  scaling  study  of  the  hierarchical  SVD  algorithm.  The  input  matrix  is  of  size 
2000  x  (32000M),  where  M  is  the  processing  cores  used  in  the  computation.  The  observed  efficiency 
is  plotted  for  various  n’s  (number  of  scaled  singular  vectors  concatenated  at  each  hierarchical  level). 
There  is  a  slight  efficiency  gain  when  increasing  n,  until  the  communication  cost  dominates,  or  the 
size  of  the  proxy  matrix  becomes  significantly  large. 

344 
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In  the  last  experiment,  we  repeat  the  weak  scaling  study  (where  the  size  of  the 
input  matrix  A  is  varied  depending  on  the  number  of  worker  nodes,  A  =  2000  x 
32,  000-M,  where  M  is  the  number  of  compute  cores,  but  utilize  a  priori  knowledge 
that  the  rank  of  A  is  much  less  than  the  ambient  dimension.  Specifically,  we  construct 
a  data  set  with  d  =  100  <C  2000.  The  hierarchical  SVD  performs  more  efficiently  if  the 
intrinsic  dimension  of  the  data  can  be  estimated  a  priori.  There  is  a  loss  of  efficiency 
when  more  than  64  cores  are  utilized.  This  is  likely  attributed  to  the  network  topology 
of  the  assigned  computational  resources. 


n  =  2 

theoretical  peak,  n  =  2 
n  =  3 

theoretical  peak,  n  =  3 
n  =  4 

theoretical  peak,  n  =  4 


Fig.  5.  Weak  scaling  study  of  the  hierarchical  SVD  algorithm  applied  to  data  with  intrinsic 
dimension  much  lower  than  the  ambient  dimension.  The  input  matrix  is  of  size  2000  X  (32000M), 
where  M  is  the  processing  cores  used  in  the  computation.  The  intrinsic  dimension  is  d  =  100  <C 
2000.  The  observed  efficiency  is  plotted  for  various  n’s  (number  of  scaled  singular  vectors  concate¬ 
nated  at  each  hierarchical  level).  As  expected,  the  theoretical  and  observed  efficiency  are  better  if 
the  intrinsic  dimension  is  known  (or  can  be  estimated)  a  priori. 


4.  Concluding  Remarks  and  Acknowledgments.  In  this  paper,  we  show 
that  the  SVD  of  a  matrix  can  be  constructed  efficiently  in  a  hierarchical  approach. 
Our  algorithm  is  proven  to  recover  exactly  the  singular  values  and  left  singular  vectors 
if  the  rank  of  the  matrix  A  is  known.  Further,  the  hierarchical  algorithm  can  be  used 
to  recover  the  d  largest  singular  values  and  left  singular  vectors  with  bounded  error. 
We  also  show  that  the  proposed  method  is  stable  with  respect  to  roundoff  errors  or 
corruption  of  the  original  matrix  entries.  Numerical  experiments  validate  the  proposed 
algorithms  and  parallel  cost  analysis. 

Although  not  shown  in  the  paper,  the  right  singular  vectors  can  be  computed 
efficiently  (in  parallel)  if  desired,  once  the  left  singular  vectors  and  singular  values  are 
known.  The  master  process  broadcasts  the  left  singular  vectors  and  singular  values 
to  each  process.  Then  columns  of  the  right  singular  vectors  can  be  constructed  by 
computing  ^-{A%)*Uj,  were  A 1  is  the  block  of  A  residing  on  process  i,  and  ( <Jj,Uj ) 
is  the  jth  singular  value  and  left  singular  vector  respectively.  The  authors  note  that 
the  practicality  of  the  hierarchical  algorithm  is  questionable  for  sparse  input  matrices, 
since  the  assembled  proxy  matrices  as  posed  will  be  dense.  Further  investigation  in 
this  direction  is  required,  but  beyond  the  scope  of  this  paper.  Lastly,  the  hierarchical 
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algorithm  has  a  map-reduce  flavor  that  will  lend  itself  well  to  a  map  reduce  framework 
such  as  Apache  Hadoop  [20]  or  Apache  Spark  [22]. 
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In  this  short  note  we  propose  a  simple  two-stage  sparse  phase  retrieval  strategy 
that  uses  a  near-optimal  number  of  measurements,  and  is  both  computationally 
efficient  and  robust  to  measurement  noise.  In  addition,  the  proposed  strategy  is 
fairly  general,  allowing  for  a  large  number  of  new  measurement  constructions  and 
recovery  algorithms  to  be  designed  with  minimal  effort. 

©  2015  Elsevier  Inc.  All  rights  reserved. 


1.  Introduction 

Herein  we  consider  the  phase  retrieval  problem  of  reconstructing  a  given  vector  x  £  CA  from  noisy 
magnitude  measurements  of  the  form 


bi  :=  |(p*,x)|2  +  m,  (1) 

where  E  CA  is  a  measurement  vector,  and  m  E  R  represents  arbitrary  measurement  noise,  for 
i  =  1 ,M.  In  particular,  we  focus  on  the  setting  where  the  dimension  N  is  either  very  large,  or  else 
the  number  of  measurements  allowed,  M,  is  otherwise  severely  restricted.  In  either  case,  our  inability  to 
gather  the  M  =  0(N )  measurements  required  for  the  recovery  of  x  in  general  [20]  forces  us  to  consider 
the  possibility  of  approximating  x  using  only  M  <C  N  magnitude  measurements,  if  possible.  This  is  the 
situation  motivating  the  compressive  phase  retrieval  problem  (see,  e.g.,  [30,31,26,24,34,15,32,35]),  in  which 
one  attempts  to  accurately  approximate  x  E  CA  using  only  M  =  o(N )  magnitude  measurements  (1)  under 
the  assumption  that  x  is  either  sparse,  or  compressible. 
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One  question  regarding  the  compressive  phase  retrieval  problem  is  how  many  measurements  are  needed  to 
allow  for  stable  reconstruction  of  x.  Clearly,  compressive  phase  retrieval  requires  at  least  as  many  measure¬ 
ments  as  the  corresponding  classical  compressive  sensing  problem  since  one  is  given  less  information.  Hence, 
stable  compressive  phase  retrieval  requires  at  least  0(slog(N/s))  magnitude  measurements 5  -  but  can  it  be 
done  with  M  =  0(s\og(N/  s))  measurements' ?  It  is  shown  in  [15]  that  stable  compressive  phase  retrieval  is 
indeed  achievable  with  M  =  0(slog(N/ s))  measurements  for  real  x  if  the  entries  of  p;  are  real  independent 
and  identically  distributed  (i.i.d.)  Gaussians.  However,  this  question  was  unresolved  in  the  complex  case.  In 
this  note  we  extend  the  result  to  the  complex  case.  Furthermore,  we  do  so  in  a  constructive  way  by  providing 
a  computational  procedure  which  can  stably  reconstruct  complex  x  using  only  0(slog(N/s))  magnitude 
measurements. 

Unlike  previous  sparse  phase  retrieval  approaches,  we  propose  a  generic  two-stage  solution  technique 
consisting  of  (i)  using  the  phase  retrieval  technique  of  one’s  choice  to  recover  compressive  sensing  measure¬ 
ments  of  x,  Cx  G  Cm,  followed  by  (ii)  utilizing  the  compressive  sensing  method  of  one’s  choice  in  order  to 
approximate  x  from  the  recovered  measurements  Cx.  As  we  shall  see,  the  generic  nature  of  the  proposed 
sparse  phase  retrieval  procedure  not  only  allows  for  a  relatively  large  number  of  measurement  matrices  and 
recovery  algorithms  to  be  used,  but  also  allows  robust  recovery  guarantees  for  the  sparse  phase  retrieval 
problem  to  be  proven  in  the  complex  setting  essentially  “for  free”  by  combining  existing  robust  recovery 
results  from  the  compressive  sensing  literature  with  robust  recovery  results  for  the  standard  phase  retrieval 
setting.  As  a  result,  we  are  able  to  show  that  0(s\og(N/  s))  magnitude  measurements  suffice  in  order  to  re¬ 
cover  a  large  class  of  compressible  vectors  with  the  same  quality  of  error  guarantee  as  commonly  achieved  in 
the  compressive  sensing  literature.  Finally,  numerical  experiments  demonstrate  that  the  proposed  approach 
is  also  both  efficient  and  robust  in  practice. 

2.  Background 

In  this  section  we  briefly  recall  selected  results  from  the  existing  literature  on  compressive  sensing  [14,17] 
and  phase  retrieval  [3,2,12,11,1,16].  Let  ||x||o  denote  the  number  of  nonzero  entries  in  a  given  x  G  CN ,  and 

|jx||p  denote  the  standard  f'p-norm  of  x  for  all  p  >  1,  i.e.,  ||x||p  :=  Zn=i  \xn\p^j  for  all  x  G  CN . 

2.1.  Compressive  sensing 

Compressive  sensing  methods  deal  with  the  construction  of  an  m  x  N  measurement  matrix,  C,  with 
m  minimized  as  much  as  possible  subject  to  the  constraint  that  an  associated  approximation  algorithm, 
Ac  :  Cm  — >  CN ,  can  still  accurately  approximate  any  given  vector  x  G  CN .  More  precisely,  compressive 
sensing  methods  allow  one  to  minimize  m,  the  number  of  rows  in  C,  as  a  function  of  s  and  N  such  that 

II Ac  (Cx)  —  x||  <  Cp  q  •  (  inf  ||x-z||  )  (2) 

p  VecM|z||0<s  "V 

holds  for  all  x  G  CN  in  various  fixed  £p,£q  norms,  1  <  q  <  p  <  2,  for  an  absolute  constant  CPtq  G  M  (e.g.,  see 
[13,17]).  Note  that  this  implies  that  x  will  be  recovered  exactly  if  it  contains  only  s  nonzero  entries.  Similarly, 
x  will  be  accurately  approximated  by  Ac  (Cx)  any  time  its  ^g-norm  is  dominated  by  its  largest  s  entries. 

There  are  a  wide  variety  of  measurement  matrices  C  G  <CmxN  with  m  =  0(s  log(N/s))  that  have  asso¬ 
ciated  approximation  algorithms,  Ac,  which  are  computationally  efficient,  numerically  robust,  and  able  to 
achieve  error  guarantees  of  the  form  (2)  for  all  x  G  CN.  For  example,  this  is  true  of  “most”  random  matrices 
C  G  CmxN  with  i.i.d.  subgaussian  random  entries  [4,17].  Similarly,  one  may  construct  such  a  C  G  <CmxN  with 


3  See,  e.g.,  Chapter  10  of  [17]  concerning  the  minimal  numbe^Bf  measurements  required  for  stable  compressive  sensing. 
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high  probability  by  selecting  a  set  of  m  =  d?(s  log4  N)  rows  uniformly  at  random  from  an  N  x  TV  discrete 
Fourier  transform  matrix  (or,  more  generally,  from  any  “sufficiently  flat”  N  x  N  unitary  matrix)  [17].  In 
either  case,  one  may  then  use  a  large  number  of  approximation  algorithms,  A^,  that  will  achieve  error  guar¬ 
antees  along  the  lines  of  (2),  including  convex  optimization  techniques  [8-10],  iterative  hard  thresholding 
[7],  (regularized)  orthogonal  matching  pursuit  [33,25,28,29],  and  the  CoSaMP  algorithm  [27],  to  name  just 
a  few. 

More  generally,  any  matrix  with  the  robust  nidi  space  property  [13]  will  have  an  associated  approximation 
algorithm  that  is  both  computationally  efficient  and  numerically  robust.  Let  S  C  {1,2,...,  N},  and  x  £  <CN . 
Then,  xg  will  denote  x  with  all  entries  not  in  S  set  to  zero.  That  is, 


fo,  if  jis, 

1  Xj,  if  jeS. 

The  robust  null  space  property  can  now  be  defined  as  follows. 


Definition  1.  Let  s,  m,N  &  N  be  such  that  s  <  m  <  N.  We  will  say  that  the  matrix  C  £  CmxiV  satisfies  the 
^2-robust  null  space  property  of  order  s  with  constants  0  <  p  <  1  and  r  >  0  if 


llxslli  <  H|x<H|i  +r||Cx||2 

holds  for  all  x  £  CN  and  S  C  {1, 2, . . . ,  N}  with  cardinality  |5|  <  s,  where  Sc  denotes  the  complement  of  S. 

In  particular,  the  following  robust  compressive  sensing  result  for  matrices  with  the  null  space  property 
is  a  restatement  of  Theorem  4.22  from  [17]. 

Theorem  1.  Suppose  that  the  matrix  C  £  CmxAr  satisfies  the  £ 2-robust  null  space  property  of  order  s  with 
constants  0  <  p  <  1  and  r  >  0.  Then,  for  any  x  £  CN ,  the  vector 


x  :=  argmin  ||z||i  subject  to  ||Cz  —  y || 2  <  rj, 
zee* 


where  y  :=  Cx  +  e  for  some  e  £  Cm  with  ||e|| 2  <  d,  will  satisfy 

C 


x  —  x|| 2  <  — p  •  (  inf  || x  —  z ||  1  I  -|-  Drj 


y/s  \zeciv,||z||o<s 
for  some  constants  C,  D  £  R+  that  only  depend  on  p  and  r. 


(3) 


(4) 


Many  matrices  exist  with  the  ^-robust  null  space  property  including,  e.g.,  “most”  randomly  constructed 
subgaussian  and  subsampled  discrete  Fourier  transform  matrices  (as  per  above).  Thus,  in  some  sense  it  is 
not  difficult  to  find  a  matrix  C  £  <CmxN  to  which  Theorem  1  will  apply.  Furthermore,  x  from  (3)  can  be 
computed  efficiently  via  convex  optimization  techniques.  See  [17]  for  details. 


2.2.  Phase  retrieval 


Noisy  phase  retrieval  problems  involve  the  reconstruction  of  a  given  vector  x  £  C^,  up  to  a  global  phase 
factor,  from  magnitude  measurements  of  the  form 

bi  ■■=  |<Pi,x)|2  +  m,  (5) 

where  p,;  £  CA  and  n*  £  R  for  i  =  1, . . . ,  M.  Vectoriilfig  (5)  yields 
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( 


b  :=  |Px|J  +  n, 


(6) 


where  b,  n  £  RM,  V  £  <CMxN ,  and  |  •  |2  :  CM  — >  RM  computes  the  component-wise  squared  magnitude 
of  each  vector  entry.  Thus,  the  primary  objective  of  phase  retrieval  is  to  construct  a  recovery  algorithm, 
<h-p  :  RM  — >  CN ,  that  satisfies  a  relative  error  guarantee  such  as,  e.g., 


min 

0e[O,27r] 


|<Fp  (b)  —  e1 


»6> 


X 


X  2 


g 

<  Cv  ■ 


llnl|2 

VmmI 


(7) 


for  a  particular  measurement  matrix  V  £  CMxN ,  q  £  [1,2],  and  approximation  factor  C-p  £  R+  (which  may 
depend  on  V). 

Several  recovery  algorithms  achieve  error  guarantees  along  the  lines  of  (7)  while  using  at  most  M  = 
0(N  log  N)  measurements,  including  both  PhaseLift  [12,11]  as  well  as  a  more  recent  graph-theoretic  and 
frame-based  approach  [1] .  In  particular,  the  following  robust  phase  retrieval  result  is  a  variant  of  Theorem  1.3 
from  [11]. 4 


Theorem  2.  Let  V  £  CMxN  have  its  M  rows  be  independently  drawn  either  uniformly  at  random  from  the 
sphere  of  radius  VN  in  CN ,  or  else  as  complex  normal  random  vectors  from  A7(0,Zw/2)  +  8A/’(0,Z^/2). 
Then,  3  universal  constants  B,C,D  £  R+  such  that  the  PhaseLift  procedure  <h-p  :  RA/  — »  satisfies 


min  W^-p  (b)  —  eD0x 
06[O,2t r]  11 


2<C'- 


j]n||i 

M||x||2 


(8) 


for  all  x  £  CN  with  probability  1  —  0{e  ^ M ),  provided  that  M  >  DN.  Here  b,  n  £  Mm  are  as  in  (6). 


Finally,  it  is  important  to  note  that  the  PhaseLift  procedure  from  Theorem  2  can  be  computed  via 
semidefinite  programming  techniques.  Thus,  it  is  computationally  tractable  for  modest  dimensions,  N.  See 
[12,11]  for  details. 


3.  A  simple  two-stage  technique  for  sparse  phase  retrieval 


In  this  section  we  consider  using  noisy  magnitude  measurements  of  the  form 

b  :=  \VCx.\2  +  n,  (9) 

where  V  £  Cmxm  is  any  phase  retrieval  matrix  with  an  associated  recovery  algorithm  :  Rm  — >  Cm  that 
has  an  error  guarantee  along  the  lines  of  (7),  and  C  £  <CmxN  is  any  compressive  sensing  matrix  with  an 
associated  approximation  algorithm  Ac  :  Cm  —>■  CN  that  has  an  error  guarantee  like  (2).  In  this  situation 
the  composition  of  the  two  recovery  algorithms,  A c  o  &p>  :  M.m  — >  CN ,  should  accurately  approximate 
x  £  C^,  up  to  a  global  phase  factor,  from  b  whenever  x  is  sufficiently  sparse  or  compressible.  This  leads  us 
to  the  following  intuitive  observation. 

Proposition  1.  Let  A  =  VC  where  C  £  CmxAr  has  the  robust  null  space  property  and  V  £  Cmxm  is  a  stable 
phase  retrieval  matrix.  Then,  A  has  the  stable  compressive  phase  retrieval  property. 

More  specifically,  the  following  compressive  phase  retrieval  result  follows  easily  from  Theorems  1  and  2. 


4  Equation  (1.8)  in  Theorem  1.3  is  technically  incorrect  as  stated  in  [11].  See  [22]  for  a  corrected  and  simplified  proof  of  Theorem  2 
as  stated  herein.  100 
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( 


5 


) 


Theorem  3.  Let  V  G  Cmxm  have  its  m  rows  be  independently  drawn  either  uniformly  at  random  from  the 
sphere  of  radius  yfm  in  Cm,  or  else  as  complex  normal  random  vectors  from  J\f  +  5A^(0,Tm/2). 

Furthermore,  suppose  that  C  G  CmxN  satisfies  the  ^-robust  null  space  property  of  order  s  with  constants 
0  <  p  <  1  and  t  >  0.  Then,  there  exists  a  phase  retrieval  procedure,  — >  Cm,  and  a  compressive 

sensing  recovery  algorithm,  A c  :  Cm  — »  C^,  such  that 


min  ||e"ex  —  Ac  (‘hp  (b)) 
ee[o,2n]  11 


(  inf  llx  —  z 

\zeCN,||z||0<s 


+  D- 


llnlli 

m||Cx||2 


(10) 


holds  for  all  x  G  CN  with  probability  1  —  0(e  Dm),  provided  that  rh>  E  •  m.  Here  b,  n  G  are  as  in  (9)  , 
and  B,E  G  R+  are  universal  constants,  while  C,D  G  M+  are  constants  that  only  depend  on  p  and  t. 


Considering  the  number  of  magnitude  measurements  required  by  Theorem  3,  we  note  that  rh  = 
0(s  log(lV/s))  such  measurements  will  suffice  to  achieve  (10)  for  all  x  G  CN  with  high  probability  whenever 
C  G  CmxN  is,  e.g.,  a  random  matrix  with  i.i.d.  subgaussian  random  entries.  In  this  situation  C  will  also 
likely  have  both  (i)  the  £2-robust  null  space  property  of  order  s  with  constants  0  <  p  <  1  and  r  >  0,  and 
(ii)  a  small  restricted  isometry  constant  of  order  2s,  d2s  <  1  (see,  e.g.,  §6.2  and  §9.1  of  [17]  for  details).  As 
a  consequence,  C  will  also  satisfy 


1 

—  ■  max 
T  Sc{l,-..,JV},|S'|=s 


pllxs-lli)  <  ||Cx||2 


(11) 


for  all  x  G  CN  with  high  probability  (w.h.p.).5  Considering  Theorem  3  error  guarantee  (10)  in  light  of  (11), 
we  can  now  see  that  Theorem  3  implies  that  all  sufficiently  compressible  vectors  with,  e.g., 


1  1 

-=  <  —  ■  max 
Hh  T  Sc{l....,N},\S\=s 


HlxsH|i) 


(12) 


will  also  satisfy 


min  ||eD0x  —  Ac  (<hp  (b))|L  <  ■  (  inf  ||x  —  zIL  ]  +  Z)||n||2  (13) 

ee[o,2n]'1  cy  vy  ^  Vzec", ||z||0<s"  "V  11  112  v  7 

w.h.p.  whenever  C  is  a  random  matrix  with  i.i.d.  subgaussian  entries. 

Finally,  it  is  interesting  to  note  that  the  two-stage  approach  outlined  in  this  section  also  confers  some 
computational  advantages.  Mainly,  the  phase  retrieval  recovery  algorithm  <l>p  :  — >  Cm  only  needs 

to  recover  a  vector  of  length  m  =  0(s\og(N/s)).  This  allows  phase  retrieval  approaches  based  on,  e.g., 
semidefinite  programming  to  efficiently  approximate  significantly  larger  vectors  x  G  CN  than  otherwise 
possible  when  N  3>  s. 


4.  Empirical  evaluation 

We  now  present  representative  results  demonstrating  the  numerical  robustness  and  efficiency  of  the 
proposed  two-step  strategy.  For  the  results  in  this  section,  we  use  PhaseLift  [12,11]  and  Basis  Pursuit  [9]  to 
solve  the  phase  retrieval  and  compressive  sensing  problems  in  steps  (i)  and  (ii),  respectively.  Moreover,  we 
use  complex  Gaussian  phase  retrieval  matrices  V  and  real  Gaussian  compressive  sensing  matrices  C.  Matlab 
code  used  to  generate  the  numerical  results  -  implemented  using  the  optimization  software  packages  TFOCS 
[6,5]  and  CVX  [19,18]  -  is  freely  available  at  [23]. 


5  The  lower  bound  is  a  simple  consequence  of  Definition  1.  FolQie  upper  bound  see,  e.g.,  Exercise  6.6  in  [17]. 
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Fig.  1.  Robustness  to  additive  noise:  N  =  1024,  s  =  5 ,rh=  [14s  log(lV/s)"| . 


65 


In  each  of  the  following  results,  we  recover  sparse,  unit-norm  complex  vectors  whose  non-zero  indices  are 
independently  and  randomly  chosen,  and,  whose  non-zero  entries  are  i.i.d.  standard  complex  Gaussians. 

Fig.  1  illustrates  the  robustness  of  the  recovery  procedure  to  additive  noise.  We  add  i.i.d.  zero-mean 
Gaussian  noise  at  several  signal-to- noise  ratios  (SNRs)  to  m  =  |"14slog(Ar/s)]  magnitude  measurements 
{N  =  1024,  s  =  5 ,rh  =  371)  and  record  relative  reconstruction  errors  in  decibels.  Each  data  point  on 
the  graph  was  obtained  by  averaging  the  results  of  100  trials.  We  observe  that  the  reconstruction  error  in 
every  case  is  approximately  equal  to  the  added  noise  level,  confirming  the  robust  recovery  properties  of  the 
proposed  method. 

Next,  we  demonstrate  efficiency  by  plotting  the  average  runtime  and  minimum  number  of  measurements 
necessary  for  successful  reconstruction.  For  the  purposes  of  this  discussion,  we  classify  a  reconstruction  as 
successful  if  the  relative  t^-norm  error  in  the  recovered  signal  is  less  than  10-5.  We  also  provide  comparisons 
with  Compressive  Phase  Retrieval  via  Lifting  ( CPRL )  [30],  an  existing  framework  for  sparse  phase  retrieval. 
Simulations  were  performed  on  a  laptop  computer  with  an  Intel®  Core™  i3-3120M  processor,  4  GB  RAM 
and  Matlab  R2014a.  We  first  consider  the  reconstruction  of  an  s-sparse  signal  ( N  =  64)  from  perfect 
(noiseless)  measurements.  The  minimum  number  of  measurements1’  required  for  successful  reconstruction  is 
plotted  in  Fig.  2a,  while  the  corresponding  runtime,  averaged  over  100  trials,  is  plotted  in  Fig.  2b.  Fig.  2a  was 
generated  by  starting  with  a  small  number  of  measurements,  m,  and  incrementing  this  number  to  ensure 
successful  reconstruction  in  at  least  95  of  the  100  trials.  We  notice  that  the  PhaseLift+ BP  formulation 
requires  a  small  number  of  additional  measurements  when  compared  to  CPRL.  This  is  potentially  only 
the  case  for  small  values  of  s  since  Theorem  3  shows  that  0(slog(N/s))  measurements  suffice  for  the 
PhaseLift+ BP  formulation.  Moreover,  since  the  PhaseLift+ BP  solution  is  obtained  by  solving  a  smaller 
SDP,  the  average  runtime  is  significantly  smaller  (by  several  orders  of  magnitude)  than  CPRL,  as  shown  in 
Fig.  2b. 

5.  Discussion 

It  is  interesting  to  note  that  the  compressive  phase  retrieval  strategy  discussed  herein  also  immediately 
implies  the  existence  of  stable  sublinear-time  compressive  phase  retrieval  algorithms.  These  can  be  achieved 
by  combining  the  phase  retrieval  technique  of  one’s  choice  with  a  o(iV)-time  compressive  sensing  method 


6  For  the  PhaseLift+- BP  implementation,  we  fixed  the  compilQ2 vc  sensing  problem  dimension  to  be  m  =  [1.75s  log(lV/s)]  . 
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Fig.  2.  Runtime  performance  and  minimum  number  of  measurements  required:  N  =  64,  noiseless  measurements. 


(see,  e.g.,  [21])  in  order  to  create  a  o(N)~ time  compressive  phase  retrieval  algorithm.  In  addition,  we  conclude 
by  noting  that  random  combinations  of  a  random  set  of  rows  from  a  Fourier  matrix  will  also  exhibit  the 
stable  compressive  phase  retrieval  property  by  Proposition  1/Theorem  3.  This  is  of  particular  interest  due 
to  the  special  role  that  Fourier  measurements  play  in  many  applications. 

References 


[1]  B.  Alexeev,  A.S.  Bandeira,  M.  Fickus,  D.G.  Mixon,  Phase  retrieval  with  polarization,  SIAM  J.  Imaging  Sci.  7  (1)  (2014) 
35-66. 

[2]  R.  Balan,  B.G.  Bodmann,  P.G.  Casazza,  D.  Edidin,  Painless  reconstruction  from  magnitudes  of  frame  coefficients,  J.  Fourier 
Anal.  Appl.  15  (4)  (2009)  488-501. 

[3]  R.  Balan,  P.  Casazza,  D.  Edidin,  On  signal  reconstruction  without  phase,  Appl.  Comput.  Harmon.  Anal.  20  (3)  (2006) 
345-356. 

[4]  R.  Baraniuk,  M.  Davenport,  R.  DeVore,  M.  Wakin,  A  simple  proof  of  the  restricted  isometry  property  for  random  matrices, 
Constr.  Approx.  28  (3)  (2008)  253-263. 

[5]  S.  Becker,  E.J.  Candes,  M.  Grant,  Templates  for  convex  cone  problems  with  applications  to  sparse  signal  recovery,  Math. 
Program.  Comput.  3  (3)  (2011)  165-218. 

[6]  S.  Becker,  E.J.  Candes,  M.  Grant,  TFOCS:  templates  for  first-order  conic  solvers,  version  1.3.1,  http://cvxr.com/tfocs, 
2014. 

[7]  T.  Blumensath,  M.E.  Davies,  Iterative  hard  thresholding  for  compressed  sensing,  Appl.  Comput.  Harmon.  Anal.  27  (3) 
(2009)  265-274. 

[8]  E.  Candes,  J.  Romberg,  T.  Tao,  Robust  uncertainty  principles:  exact  signal  reconstruction  from  highly  incomplete  frequency 
information,  IEEE  Trans.  Inform.  Theory  52  (2006)  489-509. 

[9]  E.  Candes,  J.  Romberg,  T.  Tao,  Stable  signal  recovery  from  incomplete  and  inaccurate  measurements,  Comm.  Pure  Appl. 
Math.  59  (8)  (2006)  1207-1223. 

[10]  E.J.  Candes,  T.  Tao,  Near  optimal  signal  recovery  from  random  projections:  universal  encoding  strategies?,  IEEE  Trans. 
Inform.  Theory  52  (12)  (2006)  5406-5425. 

[11]  E.J.  Candes,  X.  Li,  Solving  quadratic  equations  via  PhaseLift  when  there  are  about  as  many  equations  as  unknowns, 
Found.  Comput.  Math.  14  (5)  (2014)  1017-1026. 

[12]  E.J.  Candes,  T.  Strohmer,  V.  Voroninski,  PhaseLift:  exact  and  stable  signal  recovery  from  magnitude  measurements  via 
convex  programming,  Comm.  Pure  Appl.  Math.  66  (8)  (2013)  1241-1274. 

[13]  A.  Cohen,  W.  Dahmen,  R.  DeVore,  Compressed  sensing  and  best  fc-term  approximation,  J.  Amer.  Math.  Soc.  22  (1)  (2009) 
211-231. 

[14]  D.L.  Donoho,  Compressed  sensing,  IEEE  Trans.  Inform.  Theory  52  (4)  (2006)  1289-1306. 

[15]  Y.C.  Eldar,  S.  Mendelson,  Phase  retrieval:  stability  and  recovery  guarantees,  Appl.  Comput.  Harmon.  Anal.  36  (3)  (2014) 
473-494. 

[16]  M.  Fickus,  D.G.  Mixon,  A. A.  Nelson,  Y.  Wang,  Phase  retrieval  from  very  few  measurements,  Linear  Algebra  Appl.  449 
(2014)  475-499. 

[17]  S.  Foucart,  H.  Rauhut,  A  Mathematical  Introduction  to  Compressive  Sensing,  Springer,  2013. 

[18]  M.  Grant,  S.  Boyd,  Graph  implementations  for  nonsmooth  convex  programs,  in:  V.  Blondel,  S.  Boyd,  H.  Kimura  (Eds.), 
Recent  Advances  in  Learning  and  Control,  in:  Lecture  Notes  in  Control  and  Information  Sciences,  Springer- Verlag  Limited, 
2008,  pp.  95-110,  http://stanford.edu/~boyd/graph_dcpllQ2il. 

(2015),  http://dx.doi.Org/10.1016/j.acha.2015.06.007 


ARTICLE  IN  PRESS 


YACHA:1053 


8 


M.  Iwen  et  al.  /  Appl.  Comput.  Harmon.  Anal.  •  •  •  /••••) 


[19]  M.  Grant,  S.  Boyd,  CVX:  matlab  software  for  disciplined  convex  programming,  version  2.1,  http://cvxr.com/cvx,  2014. 

[20]  T.  Heinosaari,  L.  Mazzarella,  M.M.  Wolf,  Quantum  tomography  under  prior  information,  Comm.  Math.  Phys.  318  (2) 
(2013)  355-374. 

[211  M.  Iwen,  Compressed  sensing  with  sparse  binary  matrices:  instance  optimal  error  guarantees  in  near-optimal  time,  J.  Com¬ 
plexity  30  (1)  (2014)  1-15. 

[22]  M.  Iwen,  F.  Krahmer,  A.  Viswanathan,  Technical  note:  a  minor  correction  of  Theorem  1.3  from  [1],  http: / /www.math.msu. 
edu / -markiwen /Papers /PhaseLiftproof .pdf ,  2015. 

[23]  M.  Iwen,  Y.  Wang,  A.  Viswanathan,  SparsePR:  matlab  software  for  sparse  phase  retrieval,  version  1.0,  https://bitbucket. 
org/charms/sparsepr,  2014. 

[24]  K.  Jaganathan,  S.  Oymak,  B.  Hassibi,  Sparse  phase  retrieval:  convex  algorithms  and  limitations,  in:  Proceedings  of  the 
2013  IEEE  International  Symposium  on  Information  Theory,  ISIT,  IEEE,  2013,  pp.  1022-1026. 

[25]  S.  Kunis,  H.  Rauhut,  Random  sampling  of  sparse  trigonometric  polynomials  II  -  orthogonal  matching  pursuit  versus  basis 
pursuit,  Found.  Comput.  Math.  8  (6)  (2008)  737-763. 

[26]  X.  Li,  V.  Voroninski,  Sparse  signal  recovery  from  quadratic  measurements  via  convex  programming,  SIAM  J.  Math.  Anal. 
45  (5)  (2013)  3019-3033. 

[27]  D.  Needell,  J.A.  Tropp,  CoSaMP:  iterative  signal  recovery  from  incomplete  and  inaccurate  samples,  Appl.  Comput.  Har¬ 
mon.  Anal’  26  (3)  (2009)  301-321. 

[28]  D.  Needell,  R.  Vershynin,  Uniform  uncertainty  principle  and  signal  recovery  via  regularized  orthogonal  matching  pursuit, 
Found.  Comput.  Math.  9  (2009)  317-334. 

[29]  D.  Needed,  R.  Vershynin,  Signal  recovery  from  incomplete  and  inaccurate  measurements  via  regularized  orthogonal  match¬ 
ing  pursuit,  IEEE  J.  Sel.  Top.  Signal  Process.  (2010)  310-316. 

[30]  H.  Ohlsson,  A.  Yang,  R.  Dong,  S.  Sastry,  Cprl:  an  extension  of  compressive  sensing  to  the  phase  retrieval  problem,  in: 
Proceedings  of  the  26th  Conference  on  Advances  in  Neural  Information  Processing  Systems,  2012,  pp.  1376-1384. 

[31]  P.  Schniter,  S.  Rangan,  Compressive  phase  retrieval  via  generalized  approximate  message  passing,  in:  Proc.  Allerton  Conf. 
on  Communication,  Control,  and  Computing,  2012. 

[32]  Y.  Shechtman,  A.  Beck,  Y.C.  Eldar,  Gespar:  efficient  phase  retrieval  of  sparse  signals,  IEEE  Trans.  Signal  Process.  62  (4) 
(2014)  928-938. 

[33]  J.  Tropp,  A.  Gilbert,  Signal  recovery  from  partial  information  via  orthogonal  matching  pursuit,  IEEE  Trans.  Inform. 
Theory  53  (12)  (Dec.  2007)  4655-4666. 

[34]  Y.  Wang,  Z.  Xu,  Phase  retrieval  for  sparse  signals,  Appl.  Comput.  Harmon.  Anal.  37  (3)  (2014)  531-544. 

[35]  Q.  Yapar,  V.  Pohl,  H.  Boche,  Fast  compressive  phase  retrieval  from  Fourier  measurements,  arXiv:1410.7351,  2014. 


104 

Please  cite  this  article  in  press  as:  Appl.  Comput.  Harmon.  Anal. 

(2015),  http://dx.doi.Org/10.1016/j.acha.2015.06.007 


ARTICLE  IN  PRE 


YACHA:1039 


Appl.  Comput.  Harmon.  Anal.  •••  (••••) 


ELSEVIER 


Contents  lists  available  at  ScienceDirect 

Applied  and  Computational  Harmonic  Analysis 

www.elsevier.com /locate/acha 


a - 

■  Applied  and 
Computational 
Harmonic  Analysis 


A  multiscale  sub-linear  time  Fourier  algorithm  for  noisy  data 

Andrew  Christlieb^1,  David  Lawlorb,c’*,  Yang  Wanga’2 

a  Department  of  Mathematics,  Michigan  State  University,  East  Lansing,  MI  43824,  United  States 
b  Statistical  and  Applied  Mathematical  Sciences  Institute,  19  T.  W.  Alexander  Drive,  P.O.  Box  14006, 
Research  Triangle  Park,  NC  27709-4006,  United  States 

c  Department  of  Mathematics,  Duke  University,  Box  90320,  Durham,  NC  27708-0320 ,  United  States 


ARTICLE  INFO 


ABSTRACT 


Article  history: 

Received  25  March  2014 

Received  in  revised  form  10  October 

2014 

Accepted  12  April  2015 
Available  online  xxxx 
Communicated  by  Gregory  Beylkin 


Keywords: 

Fast  Fourier  algorithms 
Multiscale  algorithms 
Fourier  analysis 


We  extend  the  recent  sparse  Fourier  transform  algorithm  of  [1]  to  the  noisy  setting, 
in  which  a  signal  of  bandwidth  N  is  given  as  a  superposition  of  k  N  frequencies 
and  additive  random  noise.  We  present  two  such  extensions,  the  second  of  which 
exhibits  a  form  of  error-correction  in  its  frequency  estimation  not  unlike  that  of  the 
/3-encoders  in  analog-to-digital  conversion  [2].  On  k- sparse  signals  corrupted  with 
additive  complex  Gaussian  noise,  the  algorithm  runs  in  time  0(/c  log(/c)  log(AT//c)) 
on  average,  provided  the  noise  is  not  overwhelming.  The  error-correction  property 
allows  the  algorithm  to  outperform  FFTW  [3] ,  a  highly  optimized  software  package 
for  computing  the  full  discrete  Fourier  transform,  over  a  wide  range  of  sparsity  and 
noise  values. 

©  2015  Elsevier  Inc.  All  rights  reserved. 


1.  Introduction 

The  Fast  Fourier  Transform  (FFT)  [4]  is  a  fundamental  numerical  algorithm  whose  importance  in  a  wide 
variety  of  applications  cannot  be  overstated.  The  FFT  reduces  the  runtime  complexity  of  calculating  the 
discrete  Fourier  transform  (DFT)  of  a  length  N  array  from  the  naive  O (IV2)  to  0(lVlog(A)).  At  the  time 
of  its  introduction  in  the  mid-1960s,  it  dramatically  increased  the  size  of  problems  that  a  typical  computer 
could  handle.  Over  the  past  fifty  years  the  typical  size  of  data  sets  has  grown  by  orders  of  magnitude,  and 
in  certain  application  areas  (e.g.  cognitive  radio  and  ultra- wideband  radar  [5,6])  the  computation  of  the 
full  FFT  is  no  longer  tractable  on  commodity  hardware.  In  this  and  other  instances,  however,  it  is  known 
a  priori  that  the  signals  of  interest  have  small  frequency  support;  that  is,  their  Fourier  transforms  are  sparse. 
This  problem  has  received  attention  from  a  number  of  research  communities  over  the  past  decade,  who  have 
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shown  that  it  is  possible  to  significantly  outperform  the  FFT  in  both  runtime  and  sampling  requirements 
when  the  number  of  significant  Fourier  modes  k  is  much  less  than  the  nominal  bandwidth  N.  Early  works 
addressing  this  topic  from  the  perspective  of  learning  boolean  functions  include  [7,8] . 

The  sparse  Fourier  transform  problem  was  first  studied  explicitly  in  [9,10],  the  latter  of  which  gave  a 
randomized  algorithm  with  runtime  and  sampling  complexity  0(k2  polylog(IV)).  3  This  was  later  improved 
to  0(fcpolylog(./V))  [11]  through  the  use  of  unequally-spaced  FFTs  [12].  For  a  given  failure  probability  5 
and  accuracy  parameter  e,  the  algorithm  returns  a  fc-term  approximation  y  to  the  DFT  of  the  input  x  such 
that  with  probability  1  —  <5  it  holds  that 

\\x-y\\l<{l  +  e)\\x-xk\\l  (1) 

Here  Xk  is  the  best  fc-term  approximation  to  x  and  ||  •  H2  is  the  discrete  £2  norm.  In  [13],  a  randomized 
0(A:2polylog(Al))  algorithm  for  the  sparse  Fourier  transform  problem  was  given  in  the  context  of  list  de¬ 
coding.4  A  separate  group  of  authors  [14]  has  developed  a  modified  version  of  the  algorithm  of  [11]  with 
runtime  0(log(iV) yj Nk  log( N)) .  While  the  dependence  on  N  is  sub-optimal  asymptotically,  in  practice  this 
algorithm  is  significantly  faster  than  either  [10]  or  11].  The  same  authors  presented  an  improved  algorithm 
with  runtime  0(klog(N)log(N/k))  in  [15]  whose  frequency  identification  procedure  is  very  similar  to  [1], 
upon  which  the  present  work  is  based.  However,  the  performance  of  [15]  in  the  presence  of  noise  has  yet  to 
be  evaluated  empirically. 

The  algorithms  described  in  the  previous  paragraph  are  all  randomized,  and  so  will  fail  on  each  signal 
with  positive  probability.  Recognizing  this  as  a  potential  detriment  in  failure-intolerant  applications,  two 
authors  have  independently  given  deterministic  algorithms  for  the  sparse  Fourier  transform  problem.  In 
[16,17]  an  algorithm  with  poly(fc,log(IV))  runtime  was  given  where  the  exponent  on  k  is  at  least  six. 
This  high  dependence  on  k  renders  the  algorithm  infeasible  in  practice,  and  it  has  not  been  implemented. 
However,  we  note  that  algorithms  of  [18,16,17]  address  a  strictly  wider  class  of  signals  than  those  with 
fc-sparse  Fourier  spectrum,  specifically  those  satisfying  ||5||i/||<Sj|2  <  polylog(IV).  In  [19],  the  combinatorial 
properties  of  aliasing  among  frequencies  were  exploited  to  give  an  algorithm  with  runtime  and  sampling 
complexity  0(k2  polylog(iV)).  While  this  represented  a  major  improvement  over  the  theoretical  runtime 
complexity  of  [16],  in  practice  it  only  outperformed  the  FFT  for  relatively  modest  values  of  the  sparsity  k. 

Most  recently  the  authors  of  [1]  gave  a  deterministic  algorithm  whose  sampling  and  runtime  complexity 
are  0(k\og(k))  in  the  average  case  and  0(k2  log(fe))  in  the  worst  case.  The  worst-case  bounds  are  asymp¬ 
totically  of  the  same  order  in  k  (up  to  log  factors)  as  [19] ,  but  over  a  representative  class  of  random  signals 
it  was  shown  to  significantly  outperform  its  deterministic  and  randomized  competitors.  This  was  achieved 
by  sampling  the  input  at  two  sets  of  equispaced  points  slightly  offset  in  time.  This  time  shift  appears  in 
the  Fourier  domain  as  a  frequency  modulation,  which  allows  the  authors  to  both  detect  when  aliasing  has 
occurred  and,  for  frequencies  that  are  isolated  (i.e.  not  aliased),  to  calculate  the  frequency  value  directly. 
While  [19]  also  uses  properties  of  aliasing  to  reconstruct  frequency  values,  it  is  not  able  to  distinguish  be¬ 
tween  aliased  and  non-aliased  terms  until  sufficiently  many  DFTs  of  coprime  lengths  have  been  computed, 
and  so  is  unable  to  perform  any  better  in  the  average  case  than  in  the  worst  case.  In  the  empirical  evaluation 
of  [1]  an  improvement  of  over  two  orders  of  magnitude  was  observed  over  [11]  and  [19]. 

In  this  paper  we  extend  the  algorithm  of  [1]  to  noisy  environments  in  two  distinct  ways.  The  first  of 
these,  which  is  a  minor  modification  of  the  noiseless  algorithm,  is  based  on  a  certain  rounding  of  the 
frequency  estimates  and  was  previously  reported  in  [1],  In  this  work  we  provide  an  improved  algorithm 
and  more  detailed  analysis  of  that  earlier  work.  The  second  extension  is  the  main  result  of  this  paper 


3  We  write  /  =  polylog(g)  to  indicate  that  /  =  0(logc(<? ))  f°r  some  unspecified  constant  c. 

4  The  runtime  of  this  algorithm  was  incorrectly  stated  as  0(k[  1  ,/2polylog( Ar))  in  [1], 

5  This  algorithm  is  a  de-randomization  of  the  randomized  al^oMthm  presented  in  [18]. 
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(summarized  in  Algorithm  1  and  Theorem  4.5),  a  multiscale  error-correcting  algorithm  that  utilizes  offset 
time  samples  at  geometrically  spaced  time  shifts.  This  extension  is  in  essence  a  progressive  frequency 
identification  algorithm  not  unlike  the  /3-encoders  for  analog-to-digital  conversion  [2],  While  prior  works 
have  utilized  multiscale  frequency  identification  procedures  [7-9,13,18,16,17],  the  connection  to  /3-encoders 
is  to  the  best  of  our  knowledge  novel.  The  new  algorithm  gives  excellent  performance  in  the  noisy  setting 
without  significantly  increasing  the  computational  costs  from  the  noiseless  case.  For  both  extensions  we 
provide  detailed  mathematical  analysis  as  well  as  empirical  evaluations.  While  both  extensions  work  well 
in  the  noisy  environment,  the  multiscale  algorithm  achieves  comparable  accuracy  at  a  significantly  lower 
computational  cost. 

It  should  be  emphasized  that  our  algorithm  assumes  access  to  an  underlying  continuous  signal  S(t), 
t  G  [0,1],  rather  than  a  discrete  set  of  equidistant  samples  of  S,  which  is  the  setting  for  the  previous 
scholarship  mentioned  above.  Indeed,  this  assumption  is  critical  for  our  multiscale  algorithm,  as  it  allows 
the  separation  of  nearby  modes  by  sampling  at  finer  scales.  While  this  makes  comparisons  with  other  algo¬ 
rithms  less  straightforward,  the  assumption  is  valid  in  several  application  domains,  including  ultra- wideband 
radar  [6].  It  should  also  be  noted  that,  as  we  discuss  briefly  in  Section  5.4,  a  trivial  modification  of  our  mul¬ 
tiscale  algorithm  is  able  to  recover  a  single  non-integral  frequency  in  random  noise.  Other  works  addressing 
this  problem  assume  a  minimum  separation  between  modes  [20,21],  which  is  the  effect  achieved  by  our 
continuous  signal  assumption. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  2  we  review  the  notation  introduced  in 
[1]  that  will  be  necessary  in  the  sequel.  We  also  describe  our  noise  model,  discuss  some  of  the  problems 
noisy  signals  present  for  the  algorithm  of  [1],  and  argue  that  in  certain  applications  the  i 2  error  metric  is 
inappropriate  and  should  be  replaced  with  a  form  of  Earth  Mover’s  Distance.  We  also  describe  the  random 
signal  model  used  in  the  empirical  evaluations  in  Section  5.  In  Section  3  we  give  our  first  modified  algorithm 
and  analyze  the  dependence  of  the  sampling  rate  on  the  noise  level.  In  Section  4  we  describe  our  multiscale 
frequency  identification  procedure,  and  in  Section  5  we  provide  an  empirical  evaluation  of  the  accuracy  and 
speed  of  both  algorithms.  Finally  in  Section  6  we  provide  a  brief  conclusion. 

2.  Preliminaries 

2.1.  Notation  and  brief  review 

In  this  section  we  introduce  the  notation  that  will  be  used  in  the  remainder  of  this  paper  and  briefly 
review  the  results  in  [1].  We  denote  by  Z  the  set  of  integers,  C  the  set  of  complex  numbers,  and  we  let  N 
be  a  fixed  (large)  natural  number.  We  write  [xj  to  denote  the  largest  integer  less  than  or  equal  to  x.  All 
logarithms  are  in  base  two  unless  explicitly  specified. 

We  consider  frequency-sparse  band-limited  signals  S  :  [0, 1)  — »  C  of  the  form 

S(t)  =  Y,  (2) 

where  is  a  finite  set  of  integers  bounded  in  [— N/2,N/2)  and  0  ^  aw  e  C  for  each  u  G  12.  Denote  by 
flmin  =  min{|aw|  :  u  G  D}.  For  simplicity  we  shall  extend  S{t)  periodically  to  a  function  on  the  whole  real 
line.  The  Fourier  samples  of  S  are  given  by 


1 

S(h)  =  J  S{t)e~2nihtdt,  h  G  Z,  (3) 

0 

so  that  for  signals  of  the  form  (2)  we  have  S'(cn)  =  aj,0f7)r  w  6  Q  and  S(h)  =  0  for  all  other  h  G  Z. 
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In  practice  we  work  with  data  of  finite  length.  Given  any  finite  sequence  s  =  (so,  Si, . . . ,  sp_i)  of  length 
p  its  DFT  is  given  by 


p— 1  p— 1 

m  =  s0e~2^h/p  =  y.  ,  w 

3=0  j= 0 

where  h  =  0, 1, . . .  ,p  —  1,  s [j]  :=  Sj  and  Wp  :=  e_27n/p  is  the  primitive  p-th  root  of  unity.  The  FFT  [4] 
allows  the  computation  of 's  in  O(plogp)  steps. 

To  obtain  a  fast  reconstruction  algorithm  we  apply  the  DFT  to  selected  finite  sample  sets  of  S(t).  Let  p 
be  a  positive  integer  and  e  >  0.  The  two  sample  sets  we  use  extensively  are  Sp  and  SP]£,  which  are  length 
p  samples  of  S(t)  given  by 

S p[j]  =  S(p)’  SpAj]  =  S^  +  e),  j  =  0,1, . . .  ,p  —  1.  (5) 

For  each  h  let  A  Pj^  =  {oj  e  12  :  oj  =  h  (mod  p)},  where  ui  =  h  (mod  p)  indicates  that  uj  —  his  divisible  by  p. 
It  is  a  simple  derivation  to  obtain 

Sp[^]=P  ^2  au,->  Sp,e[h]=p  ^2  aue2'Kieui.  (6) 

p,h  £D£Ap 

Let  uj  (mod  p)  indicate  the  remainder  after  division  of  u>  by  p.  In  the  ideal  scenario  where  all  {oj  (mod  p)  : 
wed}  are  distinct  we  have 


S  P[h] 


pau  h  —  oj  (mod  p)  for  some  wed, 
0  otherwise, 


and  similarly 


(7) 


sP,M 


pa^e 


27T1EUJ 


0 


h  =  oj  (mod  p)  for  some  wed, 
otherwise. 


(8) 


Thus,  the  nonzero  elements  of  S p[h\  occur  precisely  at  the  locations  h  =  oj  (mod  p)  for  some  oj  e  D,  and 
moreover  for  such  h  we  have  |Sp[h]|  =  |SP)£[/i]|.  Furthermore  for  each  oj  £  11  and  h  =  oj  (mod  p)  we  have 


Ah] 


=  e 


27rieo; 


Hence 


27 t£oj  =  Arg  |  j  (mod  27t),  (9) 

\  Sp[h]  J 

where  Arg(z)  denotes  the  phase  angle  of  the  complex  number  z  in  [— 7r,7r).  Now  assume  that  we  have 
|e|  <  jf.  Then  oj  is  completely  determined  by  (9),  as  there  will  be  no  wrap-around  aliasing.  Hence 


oj=  - —  Arg 
27re: 


SP,E[h] 

s p[h\  , 


(10) 


The  weight  aw  can  be  recovered  via  au  =  S P[h\/p.  In  fact,  more  generally,  if  we  have  an  estimate  of  oj  e  D, 
say  |w|  <  ~2 ,  then  by  taking  e  <  j-  the  same  reconstruction  formula  (10)  holds.  We  will  use  this  observation 
in  Section  4  when  we  develop  a  multiscale  frequencJ4Mentification  procedure  for  noisy  signals. 
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Of  course  it  is  possible  that  not  all  {cu  (mod  p)  :  uj  £E  0}  are  distinct.  For  an  w  £  SI  we  say  uj  has  a 
collision  modulo  p,  or  simply  has  a  collision  when  there  is  no  ambiguity  in  the  modulus  p,  if  there  is  at  least 
one  other  uj'  £E  Q  such  that  uj  =  uj'  (mod  p).  In  [1]  a  criterion  is  developed  to  detect  collisions  in  the  noiseless 
case.  For  we!l  and  h  =  uj  (mod  p),  it  is  clear  that  a  necessary  condition  for  no  collision  to  occur  is 


Sp,e[h] 

s  p[h\ 


^27ri£o; 


=  l. 


(ii) 


It  is  shown  in  1 1]  that  for  a  randomly  chosen  e  >  0  the  converse  holds  with  probability  one,  and  furthermore 
checking  the  condition  (11)  for  several  e  would  be  sufficient  to  deterministically  decide  whether  uj  has  a 
collision.  In  Section  4  we  use  this  latter  observation  to  devise  a  robust  test  for  collisions  even  in  the  presence 
of  noise. 

The  algorithm  developed  in  [1]  for  recovering  S(t)  is  as  follows:  First  we  pick  a  prime  p  =  pi,  which  is 
roughly  5k  where  k  =  Q|  is  the  number  of  modes  in  S(t)  ( k  is  commonly  referred  to  as  the  sparsity  of 
S(t)).  By  taking  p  >  5k  we  ensure  that  on  average  collisions  do  not  occur  for  more  than  90%  of  uj  G  fb  Let 
IT  denote  the  subset  of  consisting  of  all  non-collision  uj  £  Q.  For  each  w  £  O'  we  recover  aaje27riwt,  and 
update  S(t)  to 


s1(t)  =  s{t)-Y,a“e2,*ut-  (12) 

We  now  apply  the  above  procedure  again  for  S\  (t)  with  a  different  prime  p  =  P2  approximately  in  the  range 
of  5ki,  where  k\  =  k  —  |Q'|  is  now  the  sparsity  for  S\  (t).  This  process  is  repeated  until  all  modes  are  found. 

In  the  implementation  of  the  algorithm  we  set  a  small  threshold  in  (11)  to  check  for  collisions.  This  means 
there  is  a  small  probability  that  a  collision  is  undetected  by  our  criterion  and  a  false  value  ujq  is  put  into 
Cl'  when  it  shouldn’t  be.  In  subsequent  iterations,  this  will  create  a  new  mode  —coe27nuiot  for  some  co  £  C 
in  Si(t).  By  the  use  of  different  primes  pj  in  each  iteration  this  false  mode  will  be  identified  and  subtracted 
from  the  final  reconstruction.  In  Section  4.3  we  provide  an  improved  aliasing  test  for  our  multiscale  algorithm 
which  makes  the  inclusion  of  spurious  frequencies  even  less  likely.  However,  it  is  still  possible  that  incorrect 
modes  are  inserted  before  being  deleted  in  the  high-noise  regime,  as  we  discuss  in  Section  5.3. 

2.2.  Noise  model 

In  a  number  of  potential  application  areas  for  sparse  Fourier  algorithms,  the  samples  collected  will  be 
corrupted  by  noise.  One  example  of  sparse  Fourier  transforms  being  used  on  real  data  is  given  in  [22],  where 
an  application  to  faster  GPS  location  is  presented.  Several  previous  works  have  considered  the  sparse  Fourier 
transform  problem  for  noisy  signals,  including  the  case  of  adversarial  noise  in  [7, 13]. 6  Random  noise  models 
have  been  considered  in  [15,18,16,17],  although  the  algorithms  presented  in  those  works  for  noisy  signals 
have  yet  to  be  implemented  and  evaluated  empirically. 

In  this  paper  we  assume  an  i.i.d.  noise  model 

sp[j]  =  s  (Jpj  +  ni  =  S P\j]  +  To  (13) 

where  n  =  (n^)  are  i.i.d.  complex  random  variables  with  mean  0  and  variance  a2.  A  typical  model  is  to 
assume  {n^}  are  i.i.d.  complex  Gaussian.  With  this  noise  model  we  have 


6  The  algorithm  of  [10]  implicitly  addresses  this  challenging  seM^wig  as  well. 
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( 


S“[/i]  =  S  p[h\  +  n[/i], 


(14) 


where 


p- 1 

n[h]  =  J2^~2nihj/P- 
3=0 

By  the  i.i.d.  property  for  {re,  }  we  have  for  each  h 


E[n[h]]  =  0 


(15) 


(16) 


and 


Var[n[/i]]  =  pa2 , 

where  the  expectations  are  taken  with  respect  to  the  randomness  in  the  noise.  This  yields 


E 


s  m 


=  sp[h] 


(17) 


(18) 


and 


E 


m  -  sPMi5 


=  pa 


(19) 


Thus,  a  typical  noisy  DFT  coefficient  Sp[h]  will  deviate  from  the  true  value  S p[h]  by  an  amount  proportional 
to  a^/p.  Similarly,  for  S“  =  SPj£  +  n£  we  will  have 


E 


S  l£[h]  =S  p,e[h] 


and 


(20) 


E 


S*  [h]-S._ 


p,e  L 


=  pa 


We  now  pick  a  non-collision  utO.  Then  for  h  =  lo  (mod  p)  we  will  have 


S  p[h\  =  pau  +  O  (ay/p), 
S^s[h\=P^e2™£  +  Q(a^p). 


(21) 


(22) 


As  a  result  au  can  now  be  estimated  easily  via 


aw  =  -S“[/i]+0(4: 

p  p  \^p 


(23) 


The  real  challenge  lies  in  the  recovery  of  the  frequencies  in  fl.  Assume  that  |SPj£|  has  a  peak  at  h.  Then 
h  =  lo  (mod  p)  for  some  lo  G  12.  If  there  is  no  collision  for  lo,  in  the  noiseless  environment  lo  is  recovered 
via  (10)  as  long  as  e  <  ^.  In  the  noisy  setting  SPj£[h]/Sp[/i]  must  be  replaced  by  S“  e[h]/Sp[h].  Interestingly, 
the  mean  of  S“  £[/i]/Sp[h]  is  in  general  not  SPi£[/i]/$Jfh]  as  a  result  of  the  division.  Nevertheless  we  have 
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Sp,eW  Sp[/i]e27riaj£  +  nE[h\ 


S  p[h]  +n  [h] 

Sp[h]e27rkj£  +  0  (ay/p) 

Sp[/i]  +  O  (a^/p) 

e2™£  +  O  ( 

\awVP 


1  +  0 


(«wj 


=  e2niuJS  +  O 


a 


(24) 


Thus  the  ratio  of  noisy  DFT  coefficients  agrees  with  the  noiseless  ratio  up  to  an  error  term  on  the  order  of 


MVp' 

Given  this  estimate  for  the  ratio  of  noisy  DFT  coefficients,  we  can  derive  bounds  for  the  error  in  the 
Lee  norm  for  the  phase  angle  computed  via  Arg(z).  Let  £  be  a  lattice  in  R.  For  any  9  E  R  the  Lee  norm 
associated  with  the  lattice  C  for  9  is  given  by  the  distance  of  9  to  the  lattice  £,  i.e.  ||0||£  :=  min^g/;  1 9  —  k |. 
Under  the  Lee  norm  associated  with  the  lattice  27tZ  it  is  well  known  that  for  z,  r/  £  C  with  \r]\  <  \z\, 


||  Arg  {z  +  rj)  -  Arg(z)||27rZ  =  ||  Arg  (l  +  z  1rj)  ||27rZ 

< 

Thus  for  a  non-collision  well  and  h  =  uj  (mod  p),  the  estimates  (25)  and  (24)  combined  yield 

S: nPJh] 


(25) 


Arg 


s  m 


—  2tUjJE 


<  O 


27rZ 


a 


KWp 


(26) 


When  we  apply  the  estimate  (10)  for  ui  under  the  noise  model  we  will  end  up  with  an  approximation 

.”:=-LArg(S^) 

2,r£  V  s+l  ) 

such  that 


(27) 


llw11  —  u Hz  <  O 


27re|atJ| 


(28) 


Now  if  we  apply  the  algorithm  developed  in  [1]  the  ratio  a  ^  is  critical  in  determining  the  sensitivity  of 
our  phase  estimation  (as  well  as  the  weight  estimation)  to  noise.  Without  any  modifications  to  the  algorithm 
it  is  thus  important  that  we  choose  the  lengths  p  so  that 


£61  min  y/P 


is  within  the  tolerance. 


2.3.  Earth  mover  distance 


In  the  existing  literature  on  sparse  Fourier  transforms,  the  £2  norm  is  most  often  used  to  assess  the  quality 
of  approximation.  There  are  many  reasons  for  this  choice,  with  the  two  most  convincing  perhaps  being  the 
completeness  of  the  complex  exponentials  with  respect  to  the  £2  norm  and  Parseval’s  theorem.  For  certain 
applications,  however,  this  choice  of  norm  is  inappropriate.  For  example,  in  wide-band  spectral  estimation 
and  radar  applications,  one  is  interested  in  identifying  a  set  of  frequency  intervals  containing  active  Fourier 
modes.  In  this  case,  an  estimate  u  of  the  true  frequency  uj  with  \uj  —  uj\  <C  N  is  useful,  but  unless  uj  =  uj 
the  i 2  metric  will  report  an  error  of  size  0(amin).  HI 
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For  these  reasons,  we  propose  measuring  the  approximation  error  of  sparse  Fourier  transform  problems 
with  the  Earth  Mover  Distance  (EMD)  [23].  Originally  developed  in  the  context  of  content- based  image 
retrieval,  EMD  measures  the  minimum  cost  that  must  be  paid  (with  a  user-specified  cost  function)  to 
transform  one  distribution  of  points  into  another.  EMD  can  be  calculated  efficiently  as  the  solution  of  a  linear 
program  corresponding  to  a  certain  flow  minimization  problem.  In  addition  to  allowing  for  misidentffied 
frequencies,  this  choice  of  error  metric  also  has  the  flexibility  to  measure  the  quality  of  approximation  for 
signals  with  non- integer  frequencies.  This  important  problem  has  recently  been  considered  in  [20] ,  and  while 
not  the  primary  focus  of  this  manuscript,  in  Section  5.4  we  consider  a  modification  of  our  proposed  algorithm 
which  allows  for  the  identification  of  a  single  non-integer  frequency  in  the  presence  of  noise. 

For  our  problem,  we  consider  the  cost  to  move  a  set  of  estimated  Fourier  modes  and  coefficients 

ar^. ) }|]=:|  to  the  true  values  {(wj-,  aulj.)}J_1  under  the  cost  function 

di  ((w,  au),  (uj,  az);N)  :=  —jj—  +  K  -  a5|.  (29) 

This  choice  of  cost  function  strikes  a  balance  between  the  fidelity  of  the  frequency  estimate  (as  a  fraction 
of  the  bandwidth)  and  that  of  the  coefficient  estimate.  We  also  consider  the  “phase-only”  cost  function 

dcj(uJ i,u>2',  N)  :=  ^  1  jy — — ,  (30) 

which  provides  a  measure  of  how  close  our  frequency  estimates  are  to  the  true  values.  We  denote  the  EMD 
using  d\  by  EMD(l)  and  using  d w  by  EMD(w)  in  our  empirical  studies  in  Section  5  below. 

Since  these  error  metrics  may  be  unfamiliar  to  the  reader,  we  note  here  that  the  theoretical  best  possible 
EMD(l)  error  is  easy  to  compute  in  the  special  case  when  the  EMD(w)  error  is  zero  (i.e.,  all  frequencies 
are  estimated  correctly).  In  this  case,  we  can  combine  (23)  with  (29)  above  to  yield 

EMD(l)  =  o(  ka  )  .  (31) 

\  ^min  \JV  ) 

Note  in  particular  that  since  we  measure  distances  in  t\  the  error  scales  with  k,  rather  than  \fk  as  would 
be  the  case  in  £2-  The  case  when  EMD(w)  is  non-zero  is  much  more  difficult  to  analyze  and  is  an  important 
question  that  merits  considerable  attention.  We  plan  to  conduct  such  a  study  in  future  work. 

2-4-  Random  signal  model 

For  the  average-case  analysis  in  Section  4.3.3  and  the  empirical  evaluations  in  Section  5  we  consider  signals 
with  uniformly  random  phase  over  the  bandwidth  and  coefficients  chosen  uniformly  from  the  complex  unit 
circle.  In  other  words,  given  k  and  N,  we  choose  k  frequencies  uij  uniformly  at  random  (without  replacement) 
from  [-N/2,  N/2)  n  Z.  The  corresponding  Fourier  coefficients  a,-  are  of  the  form  e2^103' ,  where  0j  is  drawn 
uniformly  from  [0, 1).  The  signal  is  then  given  by 


s(t)  =  j2< 

3= 1 


(32) 


This  is  the  standard  signal  model  considered  in  previous  empirical  evaluations  of  sub-linear  Fourier  algo¬ 
rithms  [24,19,14,1],  We  note  here  that  we  also  conducted  the  empirical  evaluations  of  Section  5  on  signals 
whose  Fourier  coefficients  have  varying  magnitudes.  These  results  did  not  differ  substantively  from  those 
on  signals  of  the  form  (32),  so  we  omit  a  detailed  dlsloiission. 
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( b  -  l)p  +  h 

f  1 

ui  =  bp  +  h 

1  1  1 

(b  +  l)p  +  h 

1  v 

'  1 

ib-  k 

1  1  1 

)p  +  h  (b+i) 

1  > 

p  +  h 

Fig.  1.  The  rounding  procedure  is  exact  as  long  as  the  phase  estimate  uj  is  within  p/2  of  correct  multiple  of  p  (blue  region  in  figure). 
(For  interpretation  of  the  references  to  color  in  this  figure  legend,  the  reader  is  referred  to  the  web  version  of  this  article.) 


3.  Rounding:  a  minor  modification  of  noiseless  algorithm 


A  simple  modification  to  the  noiseless  algorithm  of  [1]  for  the  noisy  case  is  to  increase  the  sample 
lengths  p.  By  choosing  p  large  enough,  the  error  from  noise  can  be  mitigated  to  be  within  a  given  toler¬ 
ance.  The  modification  can  be  viewed  simply  as  rounding,  and  we  include  it  here  both  as  a  more  direct 
and  simple-to-implement  extension  as  well  as  for  comparison  purposes.  When  the  noise  level  is  low,  this 
modification  yields  reasonably  good  results. 

As  in  the  noiseless  case  we  choose  the  shift  e  >  0  so  that  e  <  j^.  In  the  noiseless  case  e  =  A  would  be 
sufficient  to  avoid  wrap-around  aliasing  in  the  phase  reconstruction.  Due  to  the  presence  of  noise  we  will 
need  to  make  e  slightly  smaller  because  of  (28).  Let  us  analyze  the  recovery  of  a  candidate  frequency  weU 
if  we  simply  carry  out  the  same  process  as  in  the  noiseless  environment. 

First  we  choose  a  length  p.  Assume  that  uj  £  D  does  not  collide  with  any  other  uj'  £  O  modulo  p.  Let 
h  —  uj  (mod  p).  The  reconstruction  of  uj  utilizes  two  factors.  First,  the  location  of  peaks  in  the  DFT  are 
robust  to  noise:  even  with  a  relatively  high  noise  level  we  may  take  h  —  uj  (mod  p)  to  be  exact.  Second,  by  (28) 
the  frequency  reconstruction  from  noisy  measurements  is  correct  up  to  an  error  term  of  size  O  (^£a  °  ^ 
By  combining  these  two  measures  we  can  more  reliably  estimate  uj. 

Our  proposed  modification  is  to  simply  round  the  noisy  frequency  estimate 


uj=  - —  Arg 
2^ re 


sg,eM 


(33) 


to  the  nearest  integer  of  the  form  np  +  h.  This  improved  estimate  is  therefore  given  by 


— ^  +  h,  (34) 

where  round(x)  returns  the  nearest  integer  to  x.  For  low  noise  levels  this  modification  will  return  the  true 
value  uj,  while  for  larger  noise  levels  it  is  possible  that  uj  deviates  by  more  than  p/2  from  the  true  frequency  uj. 
In  this  case  the  estimate  uj'  will  be  wrong  by  a  multiple  of  p.  Larger  values  of  p  will  reduce  the  likelihood 
of  an  error  in  frequency  estimation.  See  Fig.  1  for  an  illustration  of  this  rounding  procedure. 

To  ensure  that  the  estimated  frequencies  are  sufficiently  far  from  the  branch  cut  of  Arg(z)  along  the 
negative  real  axis,  we  take  the  shift  e  <  The  estimated  frequencies  then  satisfy  —N  <  uj  <  N,  while 
the  true  frequencies  he  in  the  smaller  interval  [ — iV/2,  N/2).  It  is  thus  extremely  unlikely  that  the  deviations 
due  to  the  noise  will  push  the  estimates  across  the  discontinuity. 

We  saw  in  the  previous  section  that  the  error  in  the  phase  estimation  is  on  the  order  of  o-(aminp)-1/2 
when  using  the  reconstruction  formula  (10).  When  using  the  rounding  procedure  (34),  however,  we  should 
expect  accurate  results  for  a  wider  range  of  sample  lengths  p  and  noise  levels  a.  Indeed,  note  that  the 
rounded  frequency  estimate  uj'  is  exact  as  long  as 

|w-d|13:|.  (35) 
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mean  phase  error 
without  rounding 


iog2(o) 


mean  phase  error 
with  rounding 


i°g2(o) 


Fig.  2.  (Left)  Mean  phase  error  (in  log  scale)  for  frequency  estimation  via  (10).  (Right)  Mean  phase  error  (in  log  scale)  for  frequency 
estimation  with  rounding  via  (34).  The  red  dashed  line  marks  the  transition  to  exact  recovery  when  p  >  (2cr/e)2^3. 


n  y/P 


Let  us 


Recall  from  Section  2.2  that  the  error  of  the  frequency  estimate  lu  is  on  the  order  of  0(- 
assume  that  it  is  bounded  by  C — for  some  constant  C.  Combining  this  with  the  requirement  (35)  we 

S^min  yP 

see  that  the  rounded  frequency  estimate  u}'  will  be  exact  provided 


c  a  <  I. 

eaminp3/2  2  ‘ 
2/3 


(36) 


It  follows  that  we  get  exact  reconstruction  if  p  >  ^ 

To  illustrate  this  relationship,  we  generated  1000  test  signals  with  frequencies  chosen  uniformly  at  random 
from  [—N/2,N/2)  and  set  the  corresponding  coefficient  to  unity.  Thus  our  test  signals  for  this  empirical  trial 
were  one-term  trigonometric  polynomials.  For  this  test  we  took  N  =  222,  e  =  ^  and  investigated  a  range 
of  parameters  ( a,p ).  We  reconstructed  the  frequencies  in  two  ways:  first,  simply  using  the  formula  (10),  and 
second  by  combining  this  estimate  with  the  rounding  procedure  (34).  In  Fig.  2  we  plot  the  average  phase 
error  in  logarithmic  scale  as  a  function  of  both  <r  and  p,  which  were  varied  from  2.5  x  10-5  to  0.4096  and 
from  10  to  163  840,  respectively,  by  powers  of  two. 

In  the  plot  on  the  left,  which  corresponds  to  reconstruction  using  only  (10),  we  can  clearly  see  the  contours 
of  constant  phase  error  obeying  the  relationship  log2(p)  =  21og2(cr)  +  a  for  various  a.  This  confirms  our 
analytic  estimate  from  Section  2.2  that  the  phase  error  is  proportional  to  cr/(aminy/p).  In  the  plot  on  the 
right,  which  corresponds  to  the  improved  reconstruction  using  (34),  we  can  see  that  for  large  values  of  a  and 
small  values  of  p  the  same  relationship  holds.  However,  for  smaller  a  and  larger  p  we  see  an  abrupt  transition 
to  exact  reconstruction  (the  white  area  in  the  upper-left).  The  boundary  of  this  region  (red  dashed  line) 
follows  the  relationship  log 2{p)  =  §  log2(cr)  +  16,  corresponding  to  C  =  1  in  (36)  above.  This  confirms  that 


for  small  enough  values  of  the  ratio 


-372  the  rounding  procedure  is  exact. 


3.1.  Algorithm 

Our  first  algorithm  for  noisy  signals  is  only  a  slight  modification  of  the  noiseless  algorithm  presented  in 
[1,  Algorithm  1],  Considering  (36),  we  change  the  lower  bound 


p  >  c\k 


(37) 


to 


p  >  max 


2/3' 


£Rmin  / 
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where  c\,  C2  are  constants.  In  this  way  we  ensure  that  the  choice  of  p  is  always  large  enough  to  isolate  most 
of  the  k  frequencies  on  average  as  well  as  being  large  enough  to  ensure  that  the  rounding  procedure  (34)  is 
exact.  In  all  of  our  experiments  in  Section  5  below  we  took  C2  =  4. 

4.  A  multiscale  algorithm 

In  Section  3  we  saw  that  increasing  p  sufficed  to  ensure  that  the  rounding  procedure  was  exact.  While 
this  gives  good  results  in  terms  of  accuracy,  the  increased  runtime  associated  with  larger  noise  levels  is 
undesirable.  The  main  contribution  of  this  paper  is  a  multiscale  algorithm  for  recovering  the  frequency  set 

of  the  signal  S(t).  This  algorithm  achieves  similar  accuracy  while  providing  an  improvement  of  several 
orders  of  magnitude  in  computational  efficiency. 

The  key  feature  of  this  multiscale  algorithm  is  the  employment  of  multiple  shifts  £j,  which  enable  us 
to  improve  the  accuracy  of  the  phase  estimations  progressively  without  the  need  to  significantly  increase 
the  sample  length  p.  As  we  will  see,  taking  successively  larger  shifts  enables  a  form  of  error-correction 
in  our  frequency  estimates  at  finer  and  finer  scales,  in  essence  “zooming  in”  on  the  true  frequencies  in  a 
multiscale  fashion.  The  idea  of  progressively  learning  finer  scale  approximations  to  significant  frequencies 
has  appeared  in  prior  works  addressing  the  sparse  Fourier  transform  problem,  including  [7-9,13,18,16,17], 
but  the  connection  to  /3-encoders  is  to  the  best  of  our  knowledge  novel. 

In  Section  4.1  we  give  some  background  on  our  multiscale  method  and  introduce  the  main  idea  of  our 
algorithm.  In  Section  4.2  we  prove  that  our  multiscale  approximations  are  accurate  estimates  of  the  true 
frequencies,  and  in  Section  4.3  we  describe  the  basic  multiscale  algorithm,  discuss  several  implementation 
details,  and  present  our  main  Theorem  (4.5). 

4-1-  Multiscale  frequency  estimation 


The  main  idea  for  the  multiscale  algorithm  is  that  a  value  can  be  estimated  with  high  precision  with  an 
inaccurate  (coarse)  estimator  applied  progressively  at  different  scales,  much  like  in  analog-to-digital  conver¬ 
sion  where  a  signal  value  can  be  estimated  with  very  high  precision  by  the  very  coarse  binary  quantization. 
In  our  sparse  Fourier  recovery  algorithm,  the  coarse  estimator  is  the  approximation  formula  given  by  (26) 


eoj 


(39) 


where  —z  is  measured  by  the  Lee  norm  ||  •  ||z. 

For  simplicity  let  us  assume  for  the  moment  that  our  signal  contains  a  single  frequency  u>  with  non-zero 
Fourier  coefficient.  For  a  fixed  p,  let  £3  be  our  estimate  for  uj  using  the  rounding  procedure  from  Section  3 
with  shift  £q  <  jj.  Then  we  have 


uj  =  uj  (mod  p), 


(40) 


although  in  general  uj  may  differ  from  a;  by  a  multiple  of  p. 

Suppose  now  that  we  repeat  the  computation  of  uj  using  a  larger  shift  e\  >  e o',  that  is,  we  sample  our 
signal  at  time  points  3-  +  £i,  take  the  FFT,  and  compute 


b  i 


(41) 


(note  that  we  do  not  divide  by  s\  here).  Since  in  general  £i  >  we  cannot  take  b\/e\  as  an  estimate  for  uj, 
although  it  still  holds  that  115 
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iteration  1 


iteration  2 


iteration  m 


Fig.  3.  Diagram  of  the  multiscale  frequency  estimation  procedure,  with  a  candidate  frequency  pictured  as  a  string  of  digits,  from 
most  significant  on  the  left  to  least  significant  on  the  right.  In  this  figure,  blue  regions  represent  correct  digits  learned  by  the 
algorithm,  and  orange  regions  represent  digits  where  errors  are  likely.  In  the  first  iteration,  the  most  significant  bits  are  learned 
using  shift  Subsequent  iterations  give  corrections  at  finer  scales  e^1 ,  .  .  .  ,  e^1.  (For  interpretation  of  the  references  to  color  in 
this  figure  legend,  the  reader  is  referred  to  the  web  version  of  this  article.) 


bi  «  £i uj  (mod  [-§,  §)),  (42) 

where  x  (mod  [— |,  |))  is  the  unique  value  y  in  [— |,  |)  such  that  x  =  y  (mod  1).  We  can  use  this  fact  to 
estimate  the  error  uj  —  uj  as  follows.  Note  that 


£l(c J  —  Uj)  —  £lUJ  —  £ll 0 

~  (&i  -eiw)  (mod  [-§,  §)), 

so  that 

(61  —  £\uj)  (mod  [— |,  |)) 
uj  —  ta  ~  - — . 

£i 


(43) 


(44) 


This  estimate  of  the  error  is  not  exact,  since  there  is  still  noise  that  can  perturb  the  calculated  value  b\ 
from  the  true  value  £\uj  (mod  |)).  However,  analogously  to  (28)  we  have 


(uj  —  uj)  - 


(h  -£1u>)  (mod  [-2,  o)) 


£i 


=  O 


a 


^l^min  y/p 


which  immediately  implies  that  the  updated  estimate  satisfies 

(6i  -eiw)  (mod  [-|,  |)) 


uj  —  \  uj  + 


£l 


=  o 


£l®min  yjp 


(45) 


(46) 


Since  £\  >  £oi  adding  the  correction  term  (44)  to  our  previous  estimate  uj  will  give  a  finer  approximation 
to  the  true  frequency  uj.  By  iterating  this  error  correction  process  with  progressively  larger  shifts  £j.  we 
obtain  an  algorithm  which  adaptively  corrects  for  the  error  in  a  multiscale  fashion.  See  Fig.  3  for  a  diagram 
of  the  multiscale  estimation  procedure.  In  the  next  section  we  provide  a  detailed  analysis  of  this  multiscale 
approximation  scheme,  and  prove  that  the  frequency  estimates  it  produces  are  accurate. 


4-2.  Analysis  of  multiscale  approximations 


We  begin  with  a  technical  lemma  relating  arithmetic  in  the  Lee  norm 
It  will  be  used  repeatedly  in  the  sequel. 


to  that  on  the  interval  [—h,b) 


Lemma  4.1.  Let  8  >  0  and  x  £  [—  \  +  8,  \  —  5].  AssuMi  that  ||x  —  b\\z  <  8  and  b  £  [— |,  |).  Then  \x  —  b\  <  8 


2  ?  2 _ 
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Proof.  Let  r  =  \\x  —  b\\i-  Then  x  —  b  —  ±r  +  k  for  some  k  G  Z.  If  k  =  0  we  have 

\x-b\  =  ||®  -  6||z  <  8  (47) 

by  hypothesis,  so  the  claim  holds.  Now  assume  k  ^  0.  Note  that 

\x  —  b\  <  \x\  +  |6|  <  1  —  5  (48) 

by  the  triangle  inequality  and  the  assumptions  on  x  and  b.  At  the  same  time,  we  have 

|±r  +  fc|>l  —  r>l  —  <5.  (49) 


This  is  a  contradiction,  since  (48)  and  (49)  cannot  hold  simultaneously.  Thus  we  must  in  fact  have  k  =  0, 
and  the  claim  holds.  □ 

The  following  theorem  formalizes  the  multiscale  frequency  estimation  procedure  which  was  introduced 
in  the  previous  section. 

Theorem  4.2.  Let  ui  G  [—  If ,  y).  Let  0  <  £o  <  £i  <  •  •  •  <  £m  and  bo,  b\, . . . ,  bm  6l  such  that 

|| EjU  —  bj\\z  <8,  0  <  j  <  m  (50) 

where  0  <  8  <  Assume  that  £o  <  1^|)>  and  (3j  :=  £j/£j- 1  <  (1  —  28)/ {28).  Then  there  exist  co,ci, . . . , 
cm  G  R,  each  computable  from  {sj}  and  {bj},  such  that 

r.  m  m 

| S3  —  oj\  <  — where  (51) 

£°  3  =  1  3=0  £j 

Proof.  Denote  ujq  lo.  We  hrst  note  that 

N  1 

|eow0|  <  Eo—  <  -  -  (52) 

where  the  second  inequality  follows  from  the  assumptions  of  the  theorem.  Let  Co  =  b0  (mod  [—  |)),  so 

that  |eo<^o  —  Co |  <  8  by  Lemma  4.1.  Let  Ao  =  c-o/eo,  which  represents  a  coarse  estimate  of  wo  with  the  error 
bound 


|Aq  —  wo |  <  8/eq. 


Next,  let  wi  =  wq  —  Aq.  By  the  above  |wi|  <  8/eq  and 


|£lWl|  <  —  =  018<  1-8. 

e  o  2 


(53) 


(54) 


We  then  have 


||eiw  -  6i||z  =  Iki^i  -  (bi  -  eiA0)||z  <  8.  (55) 

Set  ci  =  b\  —  £iAo  (mod  [— |,  |)).  It  follows  from  LeilMa  4.1  again  that  |eiWi  —  ci|  <8.  We  set  Ai  =  c\/e\. 
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We  can  recursively  define  Cj,  X j  and  ojj  for  all  1  <  j  <  m.  In  general  we  define  ojj  :=  Wj-i  —  ^j-i-  This 
leads  to 


£juj  I  < 


£j5 

ej-i 


Set 


Cj  =  (bj  -  £j\j-i)  (mod  |)), 


which  yields 


< 


(56) 


(57) 


(58) 


Lemma  4.1  now  gives  \£jOJj  —  Cj\  <  5.  Set  A  j  =  Cj/ej. 

Finally  denote  uim+ i  =  um  —  Xm.  It  is  straightforward  now  to  verify  that 

m 

U  =  OJo  =  X j  +  U>m+1 

3=0 

m 

=  —  +  ^m+l-  (59) 

3=0  £i 

Furthermore,  by  construction  wm+i  =  uim  —  Xm,  which  has  |wm+i|  <  8/em.  By  hypothesis  em  —  £o  n”Li&, 
yielding 


OJm+l 


5 

£o 


m 


(60) 


and  completing  the  proof.  □ 

Remark  4.1.  From  the  proof  of  Theorem  4.2  the  values  Cj  and  to  are  explicitly  computable  through  the 
recursive  formula  wo  =  w,  Co  =  bo  (mod  |)),  Ao  =  co/eo  and 

!u)j  uj  -j  —  j  Xj  _  i 

Cj  =  (bj  -  EjXj-i)  (mod  [-§,  |))  (61) 

^3  =  Cjl£j 

for  1  <  j  <  mn.  Equivalently,  we  can  write  the  updated  frequency  estimates  along  the  lines  of  (46)  as 


w0  =  b0/e  o 


0Jn+ 1  —  0Jn  + 


(bn  ~  enu3n)  (mod  [-5,  5)) 


(62) 


Corollary  4.3.  Assume  that  in  the  above  theorem  we  have  /3j  =  ft  where  ft  <  (1  —  28)/ (26),  i.e.  £j  =  ft^eo 


for  all  j.  Let  p  >  0  and  m  > 


+ 1.  Then 


\U3  —  w|  <  — 


—  m  ,  ft 
2 


(63) 
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; 


Proof.  This  is  a  straightforward  corollary.  By  Theorem  4.2  we  have 


\UJ 


r.  m  r- 

--i  <-n.sr1  =  -'S- 
£0  -LJ:  3  £0 
j=i 


(64) 


It  is  easy  to  check  that  m 


+  1  is  the  smallest  integer  such  that  )^/3  m  <  |. 


□ 


Note  that  as  mentioned  in  Section  3,  even  with  noise  the  value  uj  (mod  p)  can  be  accurately  computed 
very  reliably.  Thus  if  the  difference  \uj  —  ui\  is  smaller  than  |  then  uj  can  be  recovered  exactly  by  taking  the 
closest  integer  to  uj  with  the  same  residue  modulo  p. 

In  numerical  tests  we  choose  uniform  (3j  =  /3.  While  making  /3  as  large  as  it  can  be  for  a  given  error 
estimate  5  will  undoubtedly  reduce  the  computational  cost,  there  is  nevertheless  a  good  reason  that  we 
should  not  be  too  “greedy”  and  be  more  conservative  by  choosing  a  smaller  (3  >  1.  The  reason  is  that  given 
the  random  nature  of  the  noise  the  error  bound  5  is  only  in  the  average  sense.  To  minimize  reconstruction 
errors  we  should  try  to  provide  as  much  latitude  as  possible  for  the  uncertainties  associated  with  the  error 
estimate  6.  Hence  it  is  useful  to  ask  how  much  latitude  does  one  get  for  given  choices  of  £q  and  (3. 


Theorem  4.4.  Let  uj  G  [— y,y),  £q  >  0  and  /3  >  1.  Set  £j  =  /3j£o  for  1  <  j  <  to.  Assume  that  we  have 
bo,  bi, . . . ,  hm  G  R  such  that 


EjUJ  —  bj \\z  <  8,  1  <  j  <  m 


where 


5  =  min 


'1  —  EqN  1 


’2/3  +  2/ 


Then  the  estimate  lo  of  uj  given  by  uj  :=  Ylj= o  it  satisfies 


\uj-uj\<  —P~m, 
£o 


(65) 


(66) 


(67) 


where  Cj  are  given  in  (61). 

Proof.  The  proof  is  straightforward.  Note  that  Theorem  4.2  holds  under  the  conditions  Eo  <  and 

/3j  <  1  tg*  ■  These  two  conditions  are  equivalent  to  the  condition  6  <  min  ( 1~eo°N ;  — f-^) .  Clearly,  S  = 
min  (1-^°  ,  03+2 )  is  the  largest  admissible  value  for  5.  □ 

4-3.  Algorithm 

In  this  section  we  provide  some  details  of  our  implementation  of  the  multiscale  frequency  estimation 
procedure  described  in  Section  4.1.  In  particular,  we  discuss  the  choice  of  various  parameters  necessary  for 
reconstruction  according  to  Theorem  4.2  as  well  as  changes  made  to  the  aliasing  detection  test  from  [1] 
to  improve  robustness  in  the  presence  of  noise.  We  give  pseudocode  for  the  iterative  frequency  estimation 
procedure  in  Algorithm  1;  the  full  algorithm  is  given  by  replacing  lines  6-22  in  [1,  Algorithm  1]  with  this 
procedure.  We  also  present  our  main  Theorem  (4.5)  stating  the  correctness  and  average  and  worst  case 
runtime  and  sampling  complexity  of  the  multiscale  all^Qrithni. 
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.  .) - 

Algorithm  1  MultiscaleFreqEst. 

10: 


-  1 


Input:  S(t),  N,  k,  (3,  cr,  amin,  C(j,r) 

Output :  {uj£  }i—i 

p  <—  max  |cifc,  ^  <3(<3'*-1)°°|i°Cgffj  | 

T  ^  1  +  [loS/3  f  J 

vote^  <—  0,  £  =  1,  .  .  .  ,  k 

Sp  <—  FFT  of  ^-samples  of  S(t ) 

5:  for  j  =  0  to  m  do 

F.  1L 

^  2N 

S Pt£j  <—  FFT  of  -shifted  --samples  of  S(t) 
for  1  —  1  to  k  do 

h  <—  index  of  £t\i  largest  peak  in  Sp 

|sp[h]| 

if  r  >  r  then 

vote^  <—  vote^  +  1 

end  if 

6J.^iArg( 

if  j  =  0  then 

UJ£  bj  /  £j 

else 

we  <—  uj£  +  (6-,-  —  £j  uj£ )  (mod  [—  \ ,  ^ ) ) /e  j 

end  if 

20:  if  j  =  771  then 

u)£  p  •  round  ^  ^  ^ 

end  if 
end  for 
end  for 

25:  return  lj#  with  vote^  <  77(771  +  1) 


15: 


4-3.1.  Choice  of  p 

It  remains  to  determine  the  choice  of  sampling  length  p.  given  the  parameter  /?  and  the  noise  level  a. 
Recall  from  the  proof  of  Theorem  4.2  that  the  estimated  frequency  uj  is  given  by  the  sum  YllLi  where 
A  j  =  Cj/ej.  Moreover,  the  difference  between  successive  frequency  approximations  is  given  in  terms  of  A  j  as 

ujj  :=  uij— i  —  Aj— i  V  A  j  =  ojj  —  tOj- )-i.  (68) 

Thus  we  can  decompose  the  error  of  approximation  at  stage  j  +  1  as 


u  -  UJj+1\  =  I  (ojj  -  UJj+ 1)  -  (u)j  -  w)| 

=  I  A,-  -  (uij-u) \. 


By  Theorem  4.2  the  left-hand  side  of  (69)  satisfies 


5 

UJ  —  UJj  + 1|  <  - , 

ei+l 


(69) 


(70) 


while  analogously  to  (28)  the  right-hand  side  of  (69)  satisfies 


|A, 


(c Uj  —  uj)  |  <  O 


a 


27T  £j  ttmin  y/p 


(71) 


Denoting  by  ca  the  constant  in  the  right-hand  side  above  and  equating  the  two  upper  bounds  gives 


2n5^/p  ej+1 
- —  T29^  =:  P- 


(72) 


[-for 
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17 


; 


Under  the  assumptions  of  Theorem  4.4,  we  have 


8  =  min 


(l-e0N 


2/3  +  2 


(73) 


Since  we  take  £o  =  2lv  an<l  fhc  /3  >  1,  the  latter  term  is  necessarily  the  smaller.  Plugging  this  into  (72) 
above  and  rearranging  to  solve  for  p  gives 


P  = 


^/3(/3  +  l)aminc(Tcr^2 


(74) 


As  in  the  rounding  algorithm,  we  require  in  addition  that  p  >  c\k,  so  the  sample  lengths  for  the  multiscale 
algorithm  are  chosen  to  satisfy 


p  >  max 


/3(/3  +  l)an 


1  CfjCT 


7 r 


(75) 


4-3.2.  Robust  aliasing  test 

As  noted  in  Section  2.1,  our  frequency  estimation  procedure  works  only  for  non-collision  oj.  In  [1]  two  tests 
were  given  to  determine  whether  a  collision  had  occurred  at  a  candidate  frequency.  In  the  implementation 
of  that  algorithm  in  the  noiseless  setting,  requiring  the  ratio  (11)  to  be  within  some  threshold  of  unity 
sufficed  to  detect  collisions.  In  the  setting  of  the  current  paper,  where  the  samples  are  corrupted  with  noise, 
we  resort  to  the  second  of  the  tests  given  in  [1  ,  which  examines  the  ratios  (11)  for  several  values  of  e.  For 
0  <  j  <  rn  we  compute  the  ratio  (11)  and  compare  it  with  a  threshold  r.  We  count  the  number  of  times 
the  ratio  exceeds  r  and  reject  those  frequencies  which  fail  more  than  an  7]  fraction  of  the  tests.  Since  we 
expect  fluctuations  in  this  ratio  due  to  noise  of  order  a  ._  we  set  r  to  be  a  small  constant  multiple  of  this 
quantity. 


4-3.3.  Number  of  iterations 

Recall  from  Corollary  4.3  that,  for  constant  /3j  =  /3 ,  m  = 


+  1  shifts  suffices  to  ensure  that 


the  estimated  frequency  satisfies  |u;  —  u>\  <  |.  As  in  Section  3  we  take  £o  =  1°  av°id  the  branch 

cut  of  Arg(z).  Assume  that  the  first  term  in  (75)  is  the  larger  of  the  two,  so  that  p  =  O (k).  Then  after 
0(log (N/k))  iterations,  by  rounding  the  approximate  frequency  ui  to  the  closest  integer  of  the  form  np  +  h , 
where  h  =  u>  (mod  p)  is  known  from  the  location  of  the  peak  in  S”,  we  will  recover  the  true  frequency  lo. 
With  the  results  of  [1,  Theorems  3-4]  this  immediately  implies  the  following 


Theorem  4.5.  Let  Sn(t )  =  S(t)  +  n (t),  where  S(oj)  is  k-sparse  with  integral  frequencies  satisfying  ui  £ 
C  [— N/2,N/2)  and  n  is  complex  i.i.d.  Gaussian  noise  of  variance  a2.  Moreover,  suppose  that  k  > 
C((3(/3  +  l)amin(j)2  for  a  constant  C.  There  is  a  deterministic  algorithm  that,  given  N,k,/3  and  access  to 
Sn(t )  returns  a  list  of  k  pairs  (tD,  a$)  such  that  (i)  each  Q  £  Q  and  (ii)  for  each  Q,  there  is  an  ui  £  such 
that  \aU]  —  a^l  <  C'a/\/k. 

The  average-case  runtime  and  sampling  complexity  are 


0{k\og(h)\og(N /h))  and  0(klog(N/k)), 

respectively,  over  the  class  of  signals  in  Section  2-4-  The  worst-case  runtime  and  sampling  complexities  are 

0(k2  log(fc)  log(iV/fc))  and  0(k2  \og(N/k)), 

respectively.  121 
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k  =  256,  c  =  2,  N  =  222 


Fig.  4.  (a)  Average  EMD(l)  error  of  the  algorithms  as  a  function  of  noise  level  cr.  (b)  Average  EMD(cj)  error  as  a  function  of  noise 
level  a.  Due  to  the  log  scale  on  the  y  axis,  all  EMD(cj)  values  have  been  shifted  up  by  10  — 16  for  clarity. 


Proof.  Replacing  lines  6-22  of  [1,  Algorithm  1]  with  the  multiscale  frequency  estimation  procedure  of 
Algorithm  1  yields  an  algorithm  with  the  stated  runtime  and  sampling  complexities.  The  termination 
criteria  of  [1,  Algorithm  1]  ensures  that  k  frequencies  are  returned,  and  the  previous  analysis  in  this  section 
ensures  that  the  returned  frequencies  are  correct.  The  coefficient  estimates  a,Q  satisfy  (23),  which  together 
with  the  assumption  on  k  yields  the  claim.  □ 

5.  Empirical  evaluation 


In  this  section  we  describe  the  results  of  an  empirical  evaluation  of  the  algorithms  of  Sections  3  and  4. 
We  focus  on  two  aspects  of  the  algorithms’  performance:  accuracy  as  measured  in  the  EMD(l)  and  EMD(cu) 
metrics  (cf.  Section  2.3),  and  runtime  as  a  function  of  both  the  sparsity  k  and  the  noise  level  a.  In  all  of 
the  experiments  reported  below,  we  report  averages  over  100  random  test  signals  generated  according  to 
the  prescription  in  Section  2.4.  The  bandwidth  for  these  tests  was  fixed  at  N  =  222. 

All  experiments  were  conducted  in  C++  on  a  Linux  machine  with  four  Intel  Xeon  X5355  dual-core 
processors  at  2.66  GHz  and  64  Gb  of  RAM.  The  GNU  compiler  was  used  with  optimization  flag  -03.  For 
the  multiscale  algorithm,  it  was  determined  after  extensive  testing  that  the  choice  of  parameters  c\  =  2, 
ca  =  6,  r]  =  /3  =  2.5  gave  a  satisfactory  balance  between  runtime  and  accuracy.  All  FFTs  were  computed 

using  FFTW3  [3].  For  comparison,  we  also  present  the  results  of  the  same  trials  for  two  alternative  sparse 
Fourier  algorithms:  sFFT  1.0  [14]  and  AAFFT  [24]. 

5.1.  Accuracy 


In  Fig.  4  (a)  we  plot  the  average  EMD(l)  error  of  the  algorithms  as  a  function  of  the  noise  level  a.  For  the 
rounding  algorithm,  the  EMD(l)  error  increases  as  cr2/3,  while  for  the  other  three  it  increases  linearly.  In  all 
cases  the  EMD(l)  error  is  dominated  by  the  coefficient  error.  The  coefficient  estimates  in  all  four  algorithms 
are  given  by  an  empirical  average  of  the  samples,  and  so  the  accuracy  is  determined  by  the  number  of 
samples  taken.  This  explains  both  the  scaling  of  thd^’ror  of  our  rounding  algorithm  (recall  from  Section  3 
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Fig.  5.  (a)  Average  runtime  vs.  sparsity  k  for  the  algorithms  tested,  (b)  Average  runtime  vs.  noise  level  a. 


(  \2/3 

that  p  >  (  £°  j  ),  as  well  as  the  larger  EMD(l)  error  of  our  multiscale  algorithm,  which  performs  well 
even  with  c\  as  small  as  two.  The  multiscale  error  correction  allows  us  to  take  much  coarser  sampling  rates 
to  achieve  a  tolerable  error.  As  we  show  in  the  next  subsection,  these  coarser  sampling  rates  lead  to  much 
improved  runtime. 

In  order  to  assess  the  accuracy  of  the  frequency  lists  returned  by  each  of  the  four  algorithms,  in  Fig.  4  (b) 
we  plot  the  average  EMD(w)  error  as  a  function  of  the  noise  level.  The  EMD(w)  error  was  zero  for  all  trials 
of  the  rounding  algorithm,  as  expected  due  to  the  choice  of  p.  Moreover,  for  all  but  the  highest  noise  level, 
the  EMD(cu)  error  of  the  multiscale  algorithm  was  zero  in  all  trials.  For  most  values  of  a,  the  EMD(w)  error 
of  sFFT  1.0  was  non- zero,  indicating  that  even  at  low  to  moderate  noise  levels,  erroneous  frequencies  are 
returned.  The  EMD(w)  error  of  AAFFT  was  always  less  than  1/N,  indicating  that  true  frequencies  were 
recovered  in  all  cases;  the  non-zero  values  are  numerical  artifacts. 


5.2.  Runtime 


In  Fig.  5  (a)  we  plot  the  average  runtime  of  the  algorithms  as  a  function  of  the  sparsity  k  for  a  fixed 
value  of  the  noise  level  a  =  0.512  and  the  parameter  c\  =  2.  As  a  reference  for  runtime  comparisons, 
we  also  plot  the  time  taken  by  FFTW3  on  the  same  machine.  For  the  rounding  algorithm,  we  see  that 
there  is  no  dependence  on  k  until  k  =  64;  this  is  a  consequence  of  the  requirement  (38)  on  the  choice  of 
sampling  rate.  Thus  at  this  noise  level  our  modified  algorithm  is  slightly  slower  than  a  highly  optimized 
FFT  implementation.  The  average  runtime  of  our  multiscale  algorithm  scales  slightly  superlinearly  with  k, 
which  is  expected  given  the  runtime  bound  0(klog(k)  log(IV/A;))  of  Section  4.3.3.  Moreover,  we  note  that 
for  all  levels  of  sparsity  tested,  the  multiscale  algorithm  outperforms  AAFFT,  sFFT  1.0,  and  FFTW3. 

In  Fig.  5  (b)  we  plot  the  average  runtime  of  the  algorithms  as  a  function  of  the  noise  level  a  for  a  fixed 
value  of  the  sparsity  k  =  256.  For  the  rounding  algorithm  we  can  see  the  approximate  dependence  of  the 
runtime  on  a2' 3,  as  dictated  by  the  choice  of  p  in  (38).  For  the  multiscale  algorithm,  there  is  no  dependence 
on  <j  until  the  very  noisy  case  a  =  0.512.  123 
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Fig.  6.  Average  number  of  spurious  frequencies  inserted  and  deleted  by  the  multiscale  algorithm  as  a  function  of  k  and  a. 


5. 3.  Spurious  frequencies 

As  noted  in  Section  2,  due  to  noise  it  is  possible  that  one  or  more  spurious  frequencies  are  introduced 
into  our  signal  representation.  In  subsequent  iterations,  any  such  spurious  frequency  will  be  identified  and 
subtracted  from  the  updated  representation.  Since  this  happens  with  non-zero  probability,  it  is  of  interest 
to  examine  how  often  such  an  insertion  and  deletion  occurs.  In  Fig.  6,  we  plot  the  average  number  of 
spurious  frequencies  inserted  and  deleted  by  the  multiscale  algorithm  as  a  function  of  k  and  a.  It  is  clear 
that  the  inclusion  of  spurious  frequencies  only  occurs  in  the  high-noise,  high-sparsity  regime.  Moreover,  on 
average  only  one  such  wrong  frequency  appears  in  an  intermediate  representation  even  in  this  challenging 
environment.  This  indicates  that  our  robust  aliasing  test  of  Section  4.3.2  does  a  very  good  job  at  detecting 
collisions  in  all  but  the  most  extreme  circumstances. 


5-4 ■  Non-integer  frequency  estimation 


We  report  here  on  an  experiment  to  investigate  the  utility  of  our  multiscale  algorithm  for  the  estimation  of 
a  single  non- integer  frequency.  While  an  exhaustive  study  is  beyond  the  scope  of  this  paper,  it  is  interesting 
to  note  that  a  minor  modification  of  our  multiscale  frequency  estimation  algorithm  performs  quite  well  in 
practice.  In  addition,  this  setting  provides  another  justification  for  use  of  the  EMD  metric  for  assessing 
the  quality  of  the  output  of  our  algorithm,  since  there  is  no  way  to  compare  non-integer  frequencies  using 
a  discrete  £2  norm.  See  also  [20]  for  a  brief  discussion  of  the  output  evaluation  metric  for  this  problem. 
The  question  of  estimating  multiple  non- integer  (or  off-grid)  frequencies  in  noise  is  difficult,  requiring  more 
robust  methods  than  those  described  here.  Recent  work  addressing  this  question  from  the  algorithms  and 
optimization  perspectives  include  [20]  and  [21],  respectively. 

In  the  non-integer  frequency  case,  we  modify  Algorithm  1  to  omit  lines  20-22,  i.e.  we  do  not  round 
our  frequency  estimates  to  the  nearest  integer  with  the  same  remainder  as  h.  Our  frequency  estimates  are 
thus  not  necessarily  integers,  and  an  empirical  evaluation  shows  that  they  approximate  the  true  non-integer 
frequencies  quite  well,  even  in  the  presence  of  noise.  In  our  empirical  evaluation,  we  set  the  single  non-integer 
frequency  to  be  \/2 u,  where  u  is  uniform  on  —  T 2^2^ ’  and  set  the  corresponding  coefficient  to  be 

unity.  In  Fig.  7  we  plot  the  average  EMD(w)  error  as  a  function  of  the  noise  level  cr,  averaged  over  100  trials. 
It  is  clear  from  the  figure  that  the  EMD(w)  error  scales  linearly  with  the  noise  level,  indicating  the  robustness 
of  our  estimation  procedure  for  the  single-frequency  case.  We  do  not  attempt  to  explain  this  phenomenon 
here,  and  leave  a  detailed  study  of  this  important  cjufettion  as  a  topic  for  future  work. 
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Fig.  7.  Average  EMD(w)  error  vs.  noise  level  a  for  a  single  non-integer  frequency. 


6.  Conclusion 

In  this  paper  we  gave  two  extensions  of  the  sparse  Fourier  algorithm  of  [1]  to  handle  noisy  signals.  The 
first  of  these  was  a  minor  modification  of  the  original  algorithm  that  involved  rounding  frequency  estimates 
to  the  nearest  integer  with  the  correct  residue  modulo  the  inverse  sampling  rate.  We  showed  that  in  order 
for  this  modification  to  correctly  identify  the  true  frequencies  in  Gaussian  noise  of  standard  deviation  a 
the  sampling  rate  needed  to  satisfy  p  >  (aminff) 2'3.  While  this  resulted  in  accurate  approximations  of  the 
Fourier  transform  in  the  EMD(l)  and  EMD(w)  metrics,  the  sampling  rate  requirement  forced  the  algorithms 
to  be  slow  in  practice. 

The  second  extension  overcame  this  pitfall  by  introducing  a  multiscale  approach  to  frequency  estimation 
inspired  by  the  literature  on  ^-encoders  in  analog-to-digital  conversion.  By  using  samples  of  the  input  at 
multiple  geometrically  spaced  time  shifts,  our  algorithm  exhibits  a  form  of  error  correction  in  its  frequency 
estimation.  This  allows  the  use  of  much  coarser  sampling  rates  than  the  first  modification,  which  in  turn  leads 
to  greatly  reduced  runtimes  in  our  empirical  evaluation.  The  error  correction  of  our  multiscale  algorithm  is 
similar  to  that  of  the  /3-encoders,  and  this  connection  is  to  the  best  of  our  knowledge  novel  in  the  sparse 
Fourier  transform  context. 
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Pipeline  Schwarz  Waveform  Relaxation 


Benjamin  Ong1,  Scott  High2,  and  Felix  Kwok3 


Abstract  To  leverage  the  computational  capability  of  modem  supercomputers,  ex¬ 
isting  algorithms  need  to  be  reformulated  in  a  manner  that  allows  for  many  con¬ 
current  operations.  In  this  paper,  we  outline  a  framework  that  reformulates  classical 
Schwarz  waveform  relaxation  so  that  successive  waveform  iterates  can  be  computed 
in  a  parallel  pipeline  fashion  after  an  initial  start-up  cost.  The  communication  costs 
for  various  implementations  are  discussed,  and  numerical  scaling  results  are  pre¬ 
sented. 

Key  words:  Schwarz  waveform  relaxation,  pipeline  parallelism,  domain  decompo¬ 
sition,  distributed  computing 


1  Introduction 

Schwarz  Waveform  Relaxation  (SWR)  introduced  in  [2]  has  been  analyzed  for  a 
wide  range  of  time-dependent  problems,  including  the  parabolic  heat  equation  [7], 
wave  equation  and  advection-diffusion  equations  [6,  8],  Maxwell’s  equations  [4], 
and  the  porous  medium  equation  [9],  In  contrast  to  classical  Schwarz  iterations, 
where  the  time-dependent  PDE  is  discretized  in  time  and  domain-decomposition  is 
applied  to  the  sequence  of  steady-state  problems,  SWR  solves  lime-dependent  sub¬ 
problems;  this  relaxes  synchronization  of  the  sub-problems  and  provides  a  means 
to  couple  disparate  solvers  applied  to  individual  sub-problems,  for  example  [10]. 
SWR  has  also  been  shown  in  [8,  1]  to  have  superlinear  convergence  for  small  time 
windows.  This  paper  outlines  a  framework  that  reformulates  SWR  so  that  successive 
waveform  iterates  can  be  computed  in  a  pipeline  fashion,  allowing  for  increased  con¬ 
currency  and  hence,  increased  scalability  for  S  WR-type  algorithms.  In  §2,  we  review 
the  SWR  algorithm  before  introducing  and  comparing  several  Pipeline  Schwarz 
Waveform  Relaxation  algorithms  (PSWR)  in  §3.  Numerical  scaling  results  for  the 
linear  heat  equation  are  presented  in  §4. 
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2  Schwarz  Waveform  Relaxation 


Denote  the  PDE  of  interest  as 


w,=2z?(f,w),  (x,t)  €  £2  x  [0,7]  (1) 

u(x,0)  =  f(x),  x&£2 

u(z,t)  =  g{z,t),  zedn. 

Consider  a  partitioning  of  the  domain,  £2  =  11,72,-.  The  domains  in  the  partition  may 
be  overlapping  or  non-overlapping.  Let  m,  denote  the  solution  on  sub-domain  £2,. 
Then,  equation  (1)  can  be  decomposed  into  a  coupled  system  of  equations, 

(ui)t  =5?(t,Ui),  (x,t)  €  Qi  x  [0,7]  (2) 

m,(x,0)  =f{x),  xeQi 
ui(z,t)  =  g(z,t),  z€  d£2tnd£2, 

£?ij(ui(z,t))  =  £?ij{uj{z,t)),  z&  dQjndQj. 

where  T  are  transmission  operators  appropriate  to  the  equation  (1).  SWR  decouples 

[W 

the  system  of  PDEs  in  equation  (2).  Let  u\ J  denote  the  Cth  waveform  iterate  on 
sub-domain  Q,.  After  specifying  an  initial  estimate  for  the  sub-domain  solution  on 
the  interfaces,  uf\z,t),z  €  d£2j\d£2,  the  SWR  algorithm  iteratively  solves  PDEs 
(3)  for  waveform  iterates  k  =  1,2,...  until  convergence, 

t,uf ),  (jc,f)  G  a i  x  [0,7’]  (3) 

uf\x,0)  =f(x),  X  e  £1, 
uf\z,t)=g(z,t),  z  G  d£lif\d£2i 
£^j(uf\z,t))  =  &ij(u[j-1](z,t)),  z  €  d£2i  (T  d£2j. 

A  pseudo-code  for  the  algorithm  is  presented  on  the  next  page.  Observe  that 
SWR  allows  for  each  sub-domain  to  independently  compute  time-dependent  solu¬ 
tions  on  their  respective  sub-domains  (lines  9-11)  During  each  waveform  iteration, 
transmission  data  on  each  sub-domain  is  aggregated  for  the  entire  computational 
time  interval  before  boundary  data  is  exchanged  between  neighboring  sub-domains 
(lines  12-14). 


3  Pipeline  Schwarz  Waveform  Relaxation 

Using  a  similar  approach  described  in  [3,  12],  the  relaxation  framework  can  be 
rewritten  so  that  after  initial  start-up  costs,  multiple  waveform  iterations  can  be 
computed  in  a  pipeline-parallel  fashion.  A  graphical  example  of  the  PSWR  algo- 
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_ JSijiwarz  Waveform  Relaxation  Algorithm 

1. MPI  Initialization 

2.  parallel  for  i  =  1 ..  .N  (Sub-domain) 

3.  for  t  =  At ...T 

4.  Guess  uf\z,t),  z  6  dQi  n  dQj 

5.  end 

6.  end 

7.  for  k  =  1 . . . K  (Waveform  iteration) 

8.  parallel  for  (=  1 ..  .N  (Sub-domain) 

9.  for  t  =  At . .  .T 

10.  Solve  for  uf\t,x) 

11.  end 

12.  for  t  =  At ...T 

13.  Exchange  transmission  data  ^(itf\t,z)) 

14.  end 

15.  Check  convergence 

16.  end 

17.  end 


rithm  for  two  subdomains  is  shown  in  Figure  1.  To  simplify  the  presentation,  we 


Fig.  1  The  proposed  PSWR  algorithm  allows  for  multiple  Schwarz  waveform  iterations  to  be 
simultaneously  computed.  After  an  initial  start-up  cost,  multiple  iterates  are  computed  in  a  pipeline 
fashion. 


first  present  the  algorithm  for  the  simplified  case  where  the  same  time  discretization 
is  used  for  all  sub-problems  (Pipeline  Schwarz  Waveform  Relaxation  Algorithm  1). 

Several  observations  should  be  made  about  the  proposed  PSWR  algorithm.  First, 
a  Schwarz  iteration  can  only  proceed  if  boundary  data  (i.e.  transmission  conditions) 
from  the  previous  iterate  are  available;  this  condition  (part  of  the  start-up  cost  before 
the  PSWR  algorithm  can  be  run  in  a  pipeline  fashion)  is  checked  by  the  i  f  statement 
in  line  12.  Secondly,  transmission  data  is  exchanged  after  every  time  step  to  facilitate 
the  pipeline  parallellism.  This  added  synchronization  can  be  relaxed  at  the  expense 
of  increasing  the  start-up  cost  needed  to  run  this  algorithm  in  a  pipeline  fashion.  This 
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Pipeline  Schwarz  Waveform  Relaxation  Algorithm  1 

1. MPI  Initialization 

2.  parallel  for  i=\...N  (Sub-domain) 

3.  fort  =  At...T 

4.  Guess  uf]  (z,  t),  z  6  dtlj  n  dQj 

5.  end 

6.  Set  f[°)  =  T 

7.  end 

8.  parallel  for  k  =  1 . .  .K  (Waveform  iteration ) 

9.  parallel  for  i=  1  ...N  (Sub-domain) 

10.  setfW=Af 

11.  While  <T 

12.  If  fW  <?[*-!] 

13.  Solve  for  uf\t^,x) 

14.  Exchange  transmission  data 

15. 

16.  end 

17.  end 

18.  Check  convergence 

19.  end 

20.  end 


pipeline  parallelism  allows  for  N  ■  K  concurrent  processes  in  the  PSWR  algorithm 
with  efficiency  K^N  (accounting  for  start-up  costs),  where  N,  is  the  number  of  time 
steps  used  to  discretize  the  time  domain  [0,  T],  N  is  the  number  of  sub-domains,  K 
is  the  number  of  waveform  iterates.  This  contrasts  with  the  SWR  algorithm,  which 
can  only  utilize  N  concurrent  processes  corresponding  to  the  N  sub-domains.  This 
increased  concurrency  in  PSWR  comes  with  the  overhead  of  an  increased  number 
of  messages  and  synchronization. 

For  the  SWR  algorithm,  one  needs  to  send  0(K  —  1)  message  of  size  0{Nt). 
If  N  K  processors  are  used  in  a  pipeline  parallel  fashion  as  described  in  Pipeline 
Schwarz  Waveform  Relaxation  Algorithm  1,  0((K  —  1)  •  Nt)  messages  of  size  0(1) 
are  needed.  More  generally,  if  N  ■  p  processors  are  used  in  the  PSWR  algorithm, 
where  p  <  K  is  a  multiple  of  K,  then  0((p-  \  ) / p  ■  K  ■  N,  )  messages  of  size  0(1),  and 
0(K/p—  1)  messages  of  size  O(A) ),  are  needed.  We  note  that  the  PSWR  algorithm 
can  also  be  implemented  using  a  framework  the  naturally  reduces  the  number  of 
messages  in  a  system.  Assuming  a  heterogeneous  computing  platform  (where  each 
socket  has  multiple  cores),  one  can  use  the  MPI-3  framework  [11]  or  the  OpenMP 
protocol  in  the  outer  “parallel  for”  statement  in  line  8,  to  aggregate  transmission  data 
from  line  14  naturally  before  exchanging  transmission  data  with  neighboring  nodes. 
Alternatively,  because  nodes  working  on  waveform  iterate  k  only  need  to  communi¬ 
cate  with  waveform  iterates  k—  1,  the  PSWR  algorithm  allows  for  a  natural  grouping 
of  nodes  so  that  one  can  (in  principle)  use  multiple  overlapping  communicators  to 
leverage  data/network-topology  and  software  defined  networking  advances  [5]  to 
add  scalability. 
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Generalizations  to  allow  for  disparate  time  discretizations  in  each  sub-problem 
are  possible.  We  list  the  algorithm  without  implementation.  Unlike  PSWR  Algo¬ 
rithm  1,  it  is  not  possible  to  keep  the  “pipe”  full,  i.e.  domain  i  might  necessarily 
need  to  wait  for  it’s  neighbouring  domains  to  provide  boundary  data.  Additionally, 
solving  for  uf\tf\x)  in  line  14  requires  an  interpolation  algorithm  to  correctly  ob¬ 
tain  the  correct  transmission  condition  to  be  used  in  the  solution  of  (3).  Lastly,  an 
implementation  decision  has  to  be  made  on  how  to  collect  and  store  the  data  from 
neighboring  domains  before  the  interpolation  is  used  to  obtain  the  transmission  con¬ 
ditions  for  an  update  in  line  14. 


_ Pipeline  Schwarz  Waveform  Relaxation  Algorithm  2 

1. MPI  Initialization 

2.  parallel  for  i  =  ..  .1.. N  (Sub-domain) 

3.  for  ti  =  Ati . . .  T 

4.  Guess  w|°^  (z,  t),  z  6  dQ.i  fl  dQj 

5.  end 

6.  Set  tf]  =  T 


7.  end 

8.  parallel  for  k  =  1 . . . K  (Waveform  iteration ) 

9.  parallel  for  i  =  1 ..  .N  (Sub-domain) 

10. 

11. 

12. 

13. 

14. 


initialize  At) 
=  t 
,[*]  . 


set  tf^  =  Afj  • 


if 


fj*  ^  for  all  neighbors  j 
Solve  for 


15. 

16. 

17. 

18. 

19. 

20. 

21.  end 


Send  transmission  data  £f(iif\tf' 


t 


M  . 


-t,W+AfW 


end 

end 

Check  convergence 


end 


,  z) )  to  neighbor  nodes 


4  Numerical  Experiments 

We  present  results  from  scaling  studies,  which  vary  the  number  of  computational 
cores  used  to  compute  the  PSWR  algorithm  while  keeping  total  discretized  problem 
size  constant.  The  diffusion  equation  u,  =  k(uxx  +  uyy)  is  solved  in  M2  using  a  cen¬ 
tered  five  point  finite-difference  approximation  in  space,  and  a  backward  Euler  time 
integrator.  In  our  first  scaling  study,  400x400  grid  points  are  decomposed  into  4x4 
non-overlapping  domains  for  400  total  time  steps.  Optimized  robin  transmission 
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Fig.  2  The  error  of  the  waveform  iterates  at  time  T  is  computed  relative  to  monodomain  solu¬ 
tion  for  a  4  x  4  decomposition  of  the  problem  using  optimized  transmission  conditions.  The  con¬ 
vergence  behavior  of  the  PSWR  algorithm  is  identical  to  the  convergence  behavior  of  the  SWR 
algorithm. 


conditions  of  the  form 


=  =  (Ir*-) H. 

are  used,  where  is  the  derivative  in  the  normal  direction,  and  p  =  1.  (A  recur¬ 
sive  formula  is  used  to  compute  the  transmission  condition  in  lieu  of  discretizing 
the  derivative  in  the  normal  direction).  In  each  experiment  a  total  of  16  full  wave¬ 
form  iterations  are  completed.  Timing  results  are  obtained  using  the  stampede  su¬ 
percomputer  at  the  Texas  Advanced  Computing  Center.  Good  parallel  efficiency  and 
speedup  is  observed  in  spite  of  the  increase  in  the  number  of  messages  required  by 
the  PSWR  algorithm.  Note  that  the  4x4x1  case  is  identically  the  SWR  algorithm. 


5$ 

X 

aT 

X 

a? 

#  cores 

walltime 

speedup 

efficiency 

4x4x1 

16 

293.02  seconds 

1.00  x 

1.00 

4x4x2 

32 

149.92  seconds 

1.95  x 

0.98 

4x4x4 

64 

75.48  seconds 

3.89  x 

0.97 

4x4x8 

128 

38.71  seconds 

7.57  x 

0.95 

4  x  4  x  16 

256 

23.90  seconds 

12.26  x 

0.77 

In  our  second  scaling  study,  1600x1600  grid  points  are  decomposed  into  16x16 
non-overlapping  domains  domains  for  400  total  time  steps.  Again,  a  centered  five 
point  finite  difference  stencil,  a  backward  Euler  time  integrator,  and  optimized  trans¬ 
mission  conditions  are  used.  Good  parallel  efficiency  and  speedup  is  observed  even 
with  the  increased  synchronization/number  of  messages  in  the  system. 
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£ 

X 

X 

#  cores 

walltime 

speedup 

efficiency 

16  x  16  x  1 

256 

295.86  seconds 

1.00  x 

1.00 

16  x  16x2 

512 

155.98  seconds 

1.90  x 

0.95 

16  x  16x4 

1024 

77.10  seconds 

3.84  x 

0.96 

16  x  16x8 

2048 

43.20  seconds 

6.85  x 

0.86 

16  x  16  x  16 

4096 

26.65  seconds 

11.10  x 

0.69 

In  the  above  computations,  a  linear  solve  on  a  sub-domain  takes  (9(10  2)  sec¬ 
onds.  This  relatively  small  problems  size  was  chosen  (100  x  100  on  each  sub- 
domain)  so  that  communications  would  play  a  substantial  role  in  the  timing  studies. 
The  presented  efficiencies  can  be  improved  by  partitioning  the  problem  to  be  more 
computationally  expensive  (i.e.  more  time  is  spent  in  the  linear  solve). 


5  Conclusions 


In  this  paper,  we  have  reformulated  classical  Schwarz  waveform  relaxation  to  allow 
for  pipeline -parallel  computation  of  the  waveform  iterates,  after  an  initial  startup 
cost.  Theoretical  estimates  for  the  parallel  speedup  and  communication  overhead 
are  presented,  along  with  scaling  studies  to  show  the  effectiveness  of  the  pipeline 
Schwarz  waveform  relaxation  algorithm. 
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Revisionist  integral  deferred  correction  (RIDC)  methods  are  a  family  of  parallel-in-time  methods  to  solve 
systems  of  initial  values  problems.  The  approach  is  able  to  bootstrap  lower  order  time  integrators  to  provide 
high  order  approximations  in  approximately  the  same  wall  clock  time,  hence  providing  a  multiplicative 
increase  in  the  number  of  compute  cores  utilized.  Here  we  provide  a  library  which  automatically  produces 
a  parallel-in-time  solution  of  a  system  of  initial  value  problems  given  user  supplied  code  for  the  right  hand 
side  of  the  system  and  a  sequential  code  for  a  first-order  time  step.  The  user  supplied  time  step  routine  may 
be  explicit  or  implicit  and  may  make  use  of  any  auxiliary  libraries  which  take  care  of  the  solution  of  any 
nonlinear  algebraic  systems  which  may  arise  or  the  numerical  linear  algebra  required. 
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1.  INTRODUCTION 

The  fast,  accurate  solution  of  an  initial-value  problem  (IVP)  of  the  form 

y'O)  =  y(o)  =  yo5  te[o,T],  (l) 

where  y(f)  €  CN ,  f  :  [1  x  CN]  — »  C N ,  is  of  practical  interest  in  scientific  computing. 
IVP  (1)  often  arises  from  the  spatial  discretization  of  partial  differential  equations,  and 
may  require  either  an  explicit  or  implicit  time-integrator.  The  purpose  of  this  software 
is  to  “wrap”  a  user-implemented  first-order  explicit  or  implicit  solver  for  IVP  (1)  into 
a  high-order  parallel  solver;  that  is,  given  (■ tn,yn,fn ),  a  user  specifies  a  function  that 
returns  (tn+i,  yn+i,  fn+i)  using  either  a  forward  Euler  or  backward  Euler  integrator. 
This  work  differs  from  existing  ODE  integration  software  or  libraries,  where  a  user 
typically  only  needs  to  specify  the  system  of  ODEs  and  relevant  problem  parameters. 
The  upside  is  that  our  software  provides  a  parallel-in-time  solution  while  giving  the 
user  complete  control  of  the  first-order  time  step  routine.  For  example,  the  user  may 
chose  their  own  quality  libraries  for  the  solution  of  systems  of  nonlinear  algebraic 
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equations  or  efficient  linear  system  solvers  particularly  tuned  to  the  structure  of  their 
problems. 

There  are  three  general  approaches  for  a  time-parallel  solution  of  IVPs  [Burrage 
1993].  One  approach  is  “parallelism-across-the-problem”,  where  a  problem  is  decom¬ 
posed  into  sub-problems  that  can  be  computed  in  parallel,  and  an  iterative  procedure 
is  used  to  couple  the  sub-problems.  Examples  of  this  class  of  methods  include  parallel 
wave-form  relaxation  methods  [Vandewalle  and  Roose  1989].  The  second  approach  is 
“parallel-across-the-step”  methods,  where  the  time  domain  is  partitioned  into  smaller 
temporal  subdomains  which  are  solved  simultaneously.  Examples  of  this  class  of  meth¬ 
ods  include  parareal  methods  [Maday  and  Turinici  2002;  Gander  and  Vandewalle 
2007],  where  the  method  alternates  between  applying  a  coarse  sequential  solver  and  a 
fine  parallel  solver.  The  third  approach  is  “parallelism-across-the-method”,  where  one 
exploits  concurrent  function  evaluations  within  a  step  to  generate  a  parallel  time  inte¬ 
grator.  This  approach  typically  allows  for  small-scale  parallelism,  constrained  by  the 
number  of  function  evaluations  that  can  evaluated  in  parallel.  This  is  often  related 
to  the  order  of  the  approximation.  Examples  of  Runge-Kutta  methods  where  stages 
can  be  evaluated  in  parallel  include  [Miranker  and  Liniger  1967;  Enenkel  and  Jack- 
son  1997;  Ketcheson  and  bin  Waheed  2014].  Alternatively,  one  can  use  a  predictor- 
corrector  framework  to  generate  parallel-across-the-method  time  integrators.  This  in¬ 
cludes  parallel  extrapolation  methods  [Kappeller  et  al.  1996],  and  RIDC  integrators 
[Christlieb  et  al.  2010;  Christlieb  and  Ong  2011],  which  are  the  focus  of  this  paper.  A 
survey  of  parallel  time  integration  methods  has  recently  appeared  [Gander  2015]. 


1.1.  Related  Software 

There  are  several  well  established  software  packages  for  solving  differential  algebraic 
equations,  however  not  many  of  them  are  able  to  solve  IVPs  (1)  in  parallel.  For  sequen¬ 
tial  integrators,  probably  the  most  well  known  are  MATLAB  routines  ode45,  ode23, 
ode  15s  [Shampine  et  al.  1999]  to  solve  their  systems  of  differential  equations.  These 
schemes  use  embedded  RK  pairs  or  numerical  differentiation  formulas  (of  the  speci¬ 
fied  order)  to  approximate  solutions  to  the  differential  equations  using  adaptive  time¬ 
stepping.  Readers  might  also  be  familiar  with  DASSL  [Petzold  1983],  which  imple¬ 
ments  backward  differentiation  formulas  of  order  one  through  five.  The  nonlinear  sys¬ 
tem  at  each  time-step  is  solved  by  Newton’s  method,  and  the  resulting  linear  systems 
are  solved  using  routines  from  LINPACK.  DASSL  leverages  the  SLATEC  Common 
Mathematical  Library  [Vandevender  and  Haskell  1982]  for  step-size  adaptivity.  Also 
popular  are  ODEPACK  [Hindmarsh  1983]  and  VODE  [Brown  et  al.  1989],  a  collection 
of  fortran  solvers  for  IVPs,  SUNDIALS,  a  suite  of  robust  time  integrators  and  nonlin¬ 
ear  solvers  [Hindmarsh  et  al.  2005],  and  there  are  a  variety  of  ODE  and  DAE  time 
steppers  implemented  in  PETSc  [Balay  et  al.  2014]  and  GSL  [Gough  2009]. 

The  selection  of  parallel  solvers  for  IVPs  is  fairly  sparse.  EPPEER  [Schmitt  2013]  is  a 
Fortran95/OpenMP  implementation  of  explicit  parallel  two-step  peer  methods  [Weiner 
et  al.  2008]  for  the  solution  of  ODEs  on  multicore  architectures.  PyPFASST  [Emmett 
2013]  is  a  python  implementation  of  a  modified  parareal  solver  for  ODEs  and  PDEs 
[Emmett  and  Minion  2012].  XBRAID  [Schroder  et  al.  2015]  is  a  C  library  that  im¬ 
plements  a  multigrid-reduction-in- time  algorithm  [Falgout  et  al.  2014],  where  mul¬ 
tiple  time-grids  of  different  granularity  are  distributed  across  processors  using  MPI. 
PFASST++  [Emmett  et  al.  2015]  is  a  C++  implementation  of  the  “  parallel  full  approx¬ 
imation  scheme  in  space  and  time  (PFASST)  algorithm  [Emmett  and  Minion  2014]. 
There  are  other  implementations  (such  as  the  dependency-driven  parareal  framework 
developed  at  Oakridge  National  Laboratory  [Elwasif  et  al.  2011])  that  do  not  appear 
to  be  available  for  download  at  present. 
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2.  REVIEW  OF  RIDC  METHODS 

Spectral  deferred  correction  (SDC)  [Dutt  et  al.  2000]  provides  an  iterative  correction  of 
an  approximate  solution  by  solving  an  integral  formulation  of  an  error  equation.  This 
integral  form  stabilizes  the  classical  differential  deferred  correction  approach.  RIDC 
is  a  re-formulation  of  SDC,  pipelining  successive  calculations  so  that  corrections  can 
be  obtained  in  parallel  with  an  appropriate  time  lag.  SDC,  in  contrast,  is  a  sequential 
algorithm.  Unlike  the  spectral  deferred  correction,  which  uses  Gauss-Lobatto  nodes, 
RIDC  uses  uniformly  spaced  nodes  to  minimize  the  memory  footprint  and  to  allow  one 
to  embed  high  order  integrators  [Christlieb  et  al.  2009;  2010]. 

The  basic  idea  of  the  IDC  and  RIDC  approaches  is  to  formulate  associated  error  IVPs 
which  correct  numerical  errors  from  the  solutions  to  IVP  (1);  the  parallelism  arises 
from  the  ability  to  simultaneously  compute  solutions  to  both  IVP  (1)  and  solutions 
to  the  associated  error  IVPs.  In  this  section,  we  review  the  formulation  of  the  error 
equations,  discretizations,  and  parallel  properties  of  the  RIDC  algorithm.  Please  refer 
to  [Christlieb  et  al.  2010;  Christlieb  and  Ong  2011]  for  accuracy  and  stability  properties 
of  the  RIDC  approach. 

2.1.  Error  IVPs 

Denote  the  (unknown)  exact  solution  of  IVP  (1)  as  y(t),  and  the  approximate  solution 
as  u(t),  with  u(0)  =  7/(0) .  The  error  in  the  approximate  solution  is  e(t)  =  y(t)  —  u(t). 
Define  the  residual  (sometimes  known  as  the  defect)  as  r(t)  =  u'{t)  -  f{t ,  u).  Then,  the 
time  derivative  of  the  error  satisfies 

e'{t)  =  y'(t)  -  u'(t)  =  f(t,  u  +  e)-  f(t,  u)  -  r(t). 

Since  e(0)  =  u(0)  —  7/(0)  =  0,  we  have  just  derived  the  associated  error  IVP.  For  stability, 
the  integral  form  of  the  error  IVP  is  preferred  [Dutt  et  al.  2000], 

(6  + lo  r(T)dT)  =/(^u  +  e)“/(i>u)-  (2) 

Observing  that  the  corrected  approximation  u  +  e  is  still  an  approximation  if  the 
error  equation  (2)  is  solved  numerically,  we  adopt  a  more  general  notation  which  will 
allow  us  to  iteratively  correct  the  solution  until  a  desired  accuracy  is  reached.  Denote 
the  initial  approximation  as  v  °- ,  the  /Ah  approximation  as  u^p\  and  the  error  to  u)1'  as 
e[p]  Then,  the  error  equation  can  be  rewritten  as 

(e[pl  +  J  rlp](r)  dr^)  =  f(t,u[p]  +  e^1)  -  f(t,u[p]),  (3) 

where  =  v\p\t)'  —  f(t,u ^). 

2.2.  Discretization 

With  some  algebra,  a  first-order  explicit  discretization  of  (3),  written  in  terms  of  the 
solution,  gives 

+  A tf(tn,u%+V)  -  A tf(tn,uto)  +  J*n+1  /(T,«W)dr.  (4) 

Likewise  a  first-order  implicit  discretization  of  (3)  gives 

v%Xl]  =u]P+1]+Atf(tn+1,u^])-Atf(tn+1,ulp]+1)  +  £  f{T,u[p])dr.  (5) 
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In  both  semi-descretizations  (4)  and  (5),  a  sufficiently  accurate  quadrature  is  needed 
to  approximate  the  integrals  present  [Dutt  et  al.  2000].  If  a  first  order  predictor  was 
applied  to  obtain  an  approximate  solution  to  (1),  and  first  order  correctors  such  as  (4) 
and  (5)  are  used,  approximating  the  quadrature  using 


/(t,  u^)dr 


■p+ 1 

apvf{tn+ i-i/)  uif+i-y),  if  77,  >  p, 

!/=0 

P+1 

Y^apvf(tv,u^),  if  77  <  p, 

.!/=0 


where  ap„  are  quadrature  weights, 

if  77  >  p, 
if  77  <  p 

results  in  a  Pth  order  method,  if  (P  —  1)  such  corrections  are  applied. 


™+1  — 1  (t  -  tn+l-i) 


n 


'n 

fin+l  P+1 


(tro+1  —  v  t-n+l  —  i) 


dt, 


{t„  -  u ) 


2.3.  Stability 

A  study  of  the  (linear)  stability  of  explicit  RIDC  methods  is  provided  in  [Christlieb 
et  al.  2010]  and  for  implicit  RIDC  methods  in  [Christlieb  and  Ong  2011].  The  results 
indicate  that  the  region  of  absolute  stability  of  RIDC  methods  approach  the  region  of 
absolute  stability  of  the  underlying  predictor  as  the  number  of  time  steps  increases. 
Moreoever,  for  the  implicit  RIDC4-BE  method  preserves  the  Instability  property  of 
backward  Euler. 


2.4.  Parallelization 

As  mentioned  earlier,  the  parallelism  arises  from  the  ability  to  simultaneously  com¬ 
pute  solutions  to  both  IVP  (1)  and  solutions  to  the  associated  error  IVPs  (3).  This  is 
possible  if  there  is  some  staggering  to  decouple  solutions  of  IVP  (1)  and  the  error  equa¬ 
tions.  As  shown  in  Figure  1,  staggering  of  one  timestep  is  required  to  compute  solutions 
in  a  pipeline  parallel  fashion.  For  example,  while  the  predictor  computes  a  solution  at 
time  Co,  the  first  corrector  computes  the  correction  at  time  tg,  the  second  corrector 
the  second  correction  at  time  t8,  and  so  on.  We  discuss  the  “memory  footprint”  and 
the  startup  routine  required  by  the  RIDC  method  in  Section  2.5  before  presenting  a 
pseudo  algorithm  for  the  RIDC  methods  on  page  6. 

2.5.  Memory  Footprint,  Efficiency,  Start-up  and  Shut-down 

Figure  1  also  shows  the  “memory  footprint”  required  to  execute  the  RIDC  method  in  a 
pipeline-parallel  fashion.  The  memory  footprint  are  copies  of  the  solution  vector  eval¬ 
uated  at  earlier  correction/prediction  levels  and  time  steps;  one  can  also  think  of  the 
memory  footprint  as  the  discretization  stencil  across  the  different  correction  and  pre¬ 
diction  levels.  For  a  Pth  order  RIDC  method,  the  (P— l)st  correction  update  (i.e.  solving 
error  IVP  #(P-1))  requires  a  stencil  of  size  (P  +  1),  the  (P  —  2)nd  correction  requires  an 
additional  (P—2)  size  stencil,  the  (P— 2)nd  correction  requires  an  additional  (P—3)  size 
stencil,  and  so  on.  The  total  memory  footprint  required  for  a  Pth  order  RIDC  method 
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Fig.  1.  In  a  RIDC  method, 
solution  values  and  correc¬ 
tion  terms  are  computed  in  a 
pipeline  fashion.  For  example, 
while  a  processor  is  comput¬ 
ing  a  solution  to  IVP  (1)  at 
tio,  a  second  processor  com¬ 
putes  corrections  to  the  numer¬ 
ical  error  at  time  tg,  a  third 
processor  computes  additional 
corrections  at  time  tg,  and  so 
forth,  (i.e.,  the  open  circles  de¬ 
note  solutions  that  are  simul¬ 
taneously  computed).  The  solid 
circles  denote  stored  solution 
values  that  are  needed  for  the 
quadrature  approximation. 


is 


+  1  = 


(P-1)(P) 


+  (P-1)  +  1  = 


P(P+  1) 


In  [Christlieb  et  al.  2010]  it  is  shown  that  the  ratio  of  time  steps  taken  by  Pth-order 
RIDC-Euler  method,  using  K  steps  before  a  restart,  to  the  number  of  steps  taken  by 
the  forward  Euler  method  is 


7  =  1  + 


CP -I)2 

I< 


This  shows  that  the  method  becomes  more  efficient  (in  terms  of  wall-clock  time)  as  K 
increases.  One  does  have  to  balance  a  large  value  of  K  with  the  possible  increase  in 
error  this  may  cause.  A  study  of  this  balance  is  provided  in  [Christlieb  et  al.  2010]. 

Because  of  the  staggering,  start-up  steps  are  needed  to  fill  the  memory  footprint.  As 
discussed  in  [Christlieb  et  al.  2010],  one  should  control  the  start-up  steps  to  minimize 
the  size  of  the  memory  footprint;  that  is,  it  is  more  desirable  to  stall  the  predictors  and 
lower-level  correctors  initially  (as  appropriate)  until  all  predictors  and  correctors  can 
be  marched  in  a  pipeline  fashion  with  the  minimal  memory  footprint.  For  example, 
Figure  2  shows  the  start-up  routine  for  a  fourth-order  RIDC  method.  Initially,  only  the 
predictor  advances  the  solution  from  t0  to  t  \  in  step  one.  In  steps  two  and  three,  both 
the  predictor  and  first  corrector  are  advanced  to  populate  the  memory  stencil  in  prepa¬ 
ration  for  the  second  corrector.  In  step  four,  only  the  second  corrector  is  advanced;  the 
predictor  and  first  corrector  are  stalled  because  the  memory  stencil  needed  to  advance 
the  second  corrector  from  t\  to  t2  is  the  same  memory  stencil  needed  to  advance  the 
corrector  from  t0  to  + . 

Although  this  concept  is  easy  to  grasp,  the  startup  algorithm  looks  non-intuitive  at 
first  glance.  Algorithm  1  specifies  the  nuts-and-bolts  of  the  start-up  routine.  The  RIDC 
method  can  be  run  in  a  pipe-line  fashion  (with  the  minimal  memory  footprint)  after 
startnum  —  1  initialization  steps,  where  startnum  =  min(  1,  p<-p21^1^  —  1).  For  example, 
no  initialization  is  required  if  p  =  1.  If  p  =  4,  eight  initialization  steps  are  required 
-  the  RIDC  method  starts  marching  in  a  pipeline  fashion  at  step  nine.  In  the  RIDC 
software,  this  startup  routine  is  implemented  using  the  filter  variable. 

The  shut-down  routine  for  the  RIDC  method  is  straightforward;  each  predictor  and 
corrector  only  advances  the  solution  until  the  final  time,  tF ,  is  reached.  The  parallel 
RIDC  pseudo-code  is  summarized  in  Algorithm  2. 
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IVP(l.l) 


to  t\  t2  to  ti  to  to 


Fig.  2.  Start-up  routine  for 
a  fourth-order  RIDC  method. 
Observe  that  the  predictor  and 
lower  order  correctors  are  occa¬ 
sionally  stalled  to  ensure  that 
a  minimal  memory  footprint 
is  used  for  the  RIDC  method. 
The  fourth-order  RIDC  startup 
takes  eight  steps;  from  step 
nine  on,  the  RIDC  method  can 
*  be  marched  in  a  pipeline  fash¬ 
ion. 


ALGORITHM  1:  RIDC  Startup-Routine 

startnum  =  min(  1,  —  1); 

for  p  =  1  to  (P  —  1)  do 

march  previous  levels  (i.e.  0, . . . ,  (p  —  1))  in  a  pipe  for  one  step; 

march  current  level  (p  —  1)  steps; 

march  levels  0, ...  ,p,  in  a  pipe  for  one  step; 

end 


ALGORITHM  2:  RIDC  Pseudo  Code _ 

fill  memory  stencil,  compute  startnum  ; 
for  nt  =  startnum  to  NT  do 

for  p  =  0  to  (P  —  1)  do  in  parallel 
if  p  =  0  then 

use  step  to  advance  solution  on  prediction  level  (if  tF  not  reached  on  prediction 
level); 

else 

use  corr_f  e  or  corr  be  to  advance  solution  on  correction  level  p  (if  tF  not  reached  on 
correction  level  p); 

end 

end 

update  memory  stencil ; 

end 


3.  RIDC  SOFTWARE 

To  utilize  popular  sequential  integrators  as  described  in  Section  1.1,  a  user  often  spec¬ 
ifies  f{t,y),  the  range  of  integration  [0,  T],  the  initial  condition  y0  =  y( 0)  (and  for 
DASSL,  the  derivative  y'0  =  y'( 0)),  and  integrator  parameters  (such  as  parameters 
for  controlling  step-size  adaptivity).  While  these  general  purpose  time  integration  rou¬ 
tines  are  convenient  and  easy  to  use,  this  “black-box”  approach  (for  example,  a  user 
does  not  have  to  deal  with  the  nonlinear  solves  arising  from  the  backward  differen¬ 
tiation  formulas)  sometimes  precludes  the  use  of  additional  information,  such  as  the 
use  of  a  problem-specific  preconditioner,  sparsity  of  the  matrices,  or  multigrid  iterative 
solvers. 

The  RIDC  software  presented  here  differs  from  the  type  of  time-integration  soft¬ 
ware  mentioned  above  in  that  a  first-order,  user-specified,  advance  for  t  — >  t  +  At  is 
bootstrapped  to  generate  a  high-order,  parallel  integrator  using  the  integral  deferred 
correction  framework  described  in  Section  2. 
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3.1.  Under  the  hood 

The  RIDC  software  and  examples  are  coded  in  C++;  task  parallelism  is  achieved  using 
OpenMP  threads  to  solve  the  predictors  and  the  correctors  in  parallel.  This  mode  of 
parallelism  was  chosen  to  accommodate  the  data  movement/communication  required 
by  the  RIDC  algorithm  when  solving  equations  (4)  and  (5).  We  assume  that  the  user- 
defined  step  routine  to  advance  the  solution  is  a  first-order  sequential  integrator,  al¬ 
though  with  some  minor  modifications  to  the  RIDC  software  provided,  bootstrapping 
higher  order  integrators  is  possible.  The  RIDC  software  can  also  be  modified  to  lever¬ 
age  a  thread-safe  user-defined  step  routine,  for  example  a  CUDA-accelerated  step  rou¬ 
tine  [Ong  et  al.  2012]  or  an  MPI-parallelized  step  routine  [Christlieb  et  al.  2012]  can 
be  utilized,  see  Section  3.3.  If  the  step  routine  uses  an  explicit  Euler  integrator,  the 
RIDC  software  assumes  that  un+\  satisfies 

Un+ 1  ~Un  =  A tf{tn,  Un). 

If  the  step  routine  uses  an  implicit  Euler  integrator,  the  RIDC  software  assumes  that 
un- (_i  satisfies 

t^n+1  'U'n  =  A tf  (tn+l  >  ^n+l)  ■ 

The  RIDC  software  treats  this  step  routine  as  a  black  box,  as  depicted  in  Figure  3. 


(At,  tn7  un ) 


Step  Routine 


(tn+l ;  ^n+1 ) 


Fig.  3.  User-defined  step  routine  that  ad¬ 
vances  a  solution  from  tn  to  tn+i- 


The  RIDC  functions  solve  equations  (4)  and  (5)  by  creating  the  necessary  data  struc¬ 
tures  to  store  copies  of  the  solution  vector  described  in  Section  2.5,  and  then  perform¬ 
ing  the  appropriate  algebraic  computations  on  these  stored  solution  values.  First,  con¬ 
sider  the  explicit  Euler  discretization  of  the  error  equation  (4).  Observe  that 
can  be  constructed  by  applying  the  user-defined  step  routine  to  «!r+11  to  obtain  ' , 
and  then  adding  —A tf(tn,  uffl)  +  ffn+1  f(r,  w^)  dr  to  to  finally  obtain  The 

explicit  RIDC  wrapper  is  displayed  in  Figure  4. 


Fig.  4.  A  visualization  of  the  RIDC 
wrapper  to  obtain  a  solution  to  equa¬ 
tion  (4).  The  post  process  takes  an  input 
(fi1'  and  returns  -  A tf(t„,u$)  + 


V 


71+1 

/t!r+1  /(ri  u^p1)  dr. 


A  similar  observation  can  be  made  about  the  implicit  Euler  discretization  of  the 
error  equation  (5),  however,  one  first  constructs  the  intermediate  value  Vn  + '  ■  = 
Un+  ^  —  A tf(tn,u$)  +  /i*"+1  f(r,u^)dr,  and  then  applies  the  user-defined  step  func¬ 
tion  to  *+I]  .  The  implicit  RIDC  wrapper  is  displayed  in  Figure  5. 
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Fig.  5.  A  visualization  of  the  RIDC  wrap¬ 
per  to  obtain  a  solution  to  equation  (5). 
The  pre-process  takes  an  input  u^+1'1 
and  returns  Un+1'i  —  At/(tn,u[f+1')  + 
ft"+1  f(r,uW)dT. 


3.2.  Discussion 

The  computational  overhead  of  RIDC  methods  resides  mainly  in  the  quadrature  ap¬ 
proximation,  and  the  subsequent  linear  combinations  used  to  compute  the  corrected 
solutions.  Provided  this  computational  overhead  is  small  compared  to  an  evaluation  of 
the  step  routine,  good  parallel  speedup  is  achieved.  In  practice,  this  is  almost  always 
the  case  for  implicit  RIDC  methods  where  solutions  to  linear  equations,  and/or  New¬ 
ton  iterations  are  required.  For  explicit  RIDC  methods,  good  parallel  speedup  is  only 
observed  when  the  step  routine  is  sufficiently  expensive,  such  as  in  the  computation  of 
self-consistent  forces  for  an  n-body  problem  [Christlieb  et  al.  2010]. 

As  mentioned  in  Section  2.5,  the  RIDC  method  has  to  store  copies  of  the  solution  vec¬ 
tor  evaluated  at  ealier  correction/prediction  levels.  Although  this  memory  requirement 
might  appear  restrictive,  the  memory  footprint  for  high  order  single-step,  multi-step 
or  general  linear  methods  are  similar.  Implicit  RIDC  methods  also  benefit  from  the 
loose  coupling  between  the  prediciton  and  corection  equations;  whereas  a  general  im¬ 
plicit  s-stage  implicit  RK  method  neccessitates  the  solution  of  a  system  of  (potentially 
nonlinear)  sN  equations,  where  N  are  the  number  of  differential  algebraic  equations. 
A  pth-order  RIDC  method  constructed  using  backward  Euler  integrators  requires  the 
solution  of  p  decoupled  systems  of  N  (potentially  nonlinear)  algebraic  equations. 

3.3.  Possible  Generalizations 

For  clarity,  only  the  simplest  variant  of  the  RIDC  method  (constructed  using  first  order 
Euler  integrators,  uniform  time-stepping,  serial  computation  of  the  step  routine)  has 
been  presented,  and  released  as  part  of  the  base  software  version.  Here,  we  make 
some  remarks  on  how  the  base  version  of  the  software  can  be  modified  by  the  user  to 
accommodate  several  generalizations  discussed  in  this  secton;  indeed,  the  authors  will 
release  (when  possible)  modified  versions  of  the  software  within  the  source  repository 
that  illustrate  how  to  generate  generalized  RIDC  integrators. 

Step-size  adaptivity  for  error  control.  In  [Christlieb  et  al.  2015],  various  variants  of 
adaptive  RIDC  methods  were  presented.  In  the  simplest  variant,  one  uses  standard  er¬ 
ror  control  stratagies  to  adaptively  select  step-sizes  while  solving  IVP  (1).  These  adap¬ 
tively  selected  step-sizes  are  used  for  solving  the  error  equations  (2).  To  build  step-size 
adaptivity  into  the  provided  RIDC  software,  the  following  modifications  will  be  needed: 
(i)  modify  the  time-loop  appropriately  to  allow  for  non-uniform  steps,  (ii)  modify  the 
driver  file  appropriately  to  take  a  user-defined  tolerance  (as  opposed  to  the  number  of 
time  steps),  (iii)  recompute  the  integration  matrix  containing  the  quadrature  weights 
at  every  time  step.  The  user  will  presumably  provide  an  additional  adapt_step  func¬ 
tion,  which  takes  as  inputs  the  solution  at  time  t,  the  previous  time  step  used,  A toW,  a 
tolerance  tol,  and  returns  the  time  step  selected,  At,  and  the  solution  at  the  new  time 
step,  t  +  At. 

Restarts:  As  discussed  in  [Christlieb  et  al.  2010],  the  RIDC  method  accumulates  er¬ 
ror  while  running  in  a  pipeline  fashion  -  the  most  accurate  solution  does  not  propagate 
to  the  earlier  prediction/correction  levels.  In  some  cases,  it  might  be  advantageous  to 
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stop  the  RIDC  method,  and  use  the  most  accurate  solution  to  “restart”  the  computa¬ 
tion.  This  requires  only  a  simple  modification  to  the  main  RIDC  loop  in  ride .  cpp. 

Constructing  RIDC  methods  using  higher-order  integrators :  With  a  few  modifica¬ 
tions,  it  is  possible  to  use  higher-order  single  step  integrators  within  the  RIDC  soft¬ 
ware.  The  memory  stencil,  integration  matrix  and  quadrature  approximations  will 
need  to  be  modified  in  ride .  cpp. 

Semi-implicit  RIDC  methods:  Although  semi-implicit  RIDC  methods  have  been  con¬ 
structed  and  studied  in  [Ong  et  al.  2012],  it  is  in  general  not  possible  to  wrap  a  user- 
defined  semi-implicit  step  function  to  solve  the  error  equation  (2).  Consider  the  IVP 

y'(t)  =  fN(t,  y)  +  fs(t,  y), 

where  fs  contains  stiff  terms  and  fN  contains  the  nonstiff  terms.  A  first-order  user- 
defined  step  function  to  solve  this  IVP  would  look  like 


r^n+i  nn  —  AtfN(tn,  un)  T  At/g (tn_|_i , 

whereas  the  first-order  IMEX  discretization  of  the  error  equation  (2)  is 

«!r+l]  =  «n+1]  +  \.fN{tn,u]P+1] )  +  /s^n+l.U^+i1)]  -  At  \fN (tn ,  U[f 1 )  +  fS{tn+1,V%]+1) 


/tv(t,m[p1)  +  fs(r,ulPl) 


M' 


dr. 


Althought  it  is  not  obvious  how  to  automaticaly  bootstrap  a  semi-implicit  step  func¬ 
tion,  a  user  can  leverage  the  data  structures  and  quadrature  approximations  in 
ride .  cpp  to  construct  a  new  corr_f be  function,  which  should  look  similar  in  structure 
to  the  users’  step  function. 

Using  accelerators  for  the  step  routine:  Many  computing  clusters  feature  nodes  with 
multiple  accelerators,  e.g.  Nvidia  GPGPUs  or  Intel  Xeon  Phis.  If  the  user  wishes  to  pro¬ 
vide  a  step  routine  that  is  accelerated  using  these  emerging  architectures,  the  RIDC 
code  can  be  modified  to  leverage  multiple  accelerators  in  a  computational  node.  Modifi¬ 
cations  that  are  required  include:  adding  an  input  variable  “level”  (an  integer  from  0  to 
p  —  1,  where  p  is  the  desired  order  /  number  of  accelerators  in  the  system)  into  the  step 
routine,  a  function  call  within  the  step  function  to  specify  the  appropriate  accelerator, 
e.g.  cudaSetDevice  for  the  NVIDIA  GPGPUs,  and  a  modification  of  ride,  cpp  so  that 
the  prediction/correction  level  is  fed  into  the  step  function,  ensuring  that  the  linear 
algebra  is  performed  on  the  appropriate  accelerator. 

Using  distributed  MPI  for  the  step  routine:  Although  the  RIDC  software  can  be  mod¬ 
ified  to  allow  for  an  MPI-distributed  step  routine  (provided  this  step-routine  is  thread 
safe),  we  showed  in  [Haynes  and  Ong  2014]  that  a  tighter  coupling  of  the  hybrid  MPI- 
OpenMP  formulation  to  reduce  the  number  of  messages  is  necessary  for  performance. 


4.  NUMERICAL  EXPERIMENT 

The  software  includes  several  examples  verifying  that  the  RIDC  methods  attain  their 
designed  orders  of  accuracy.  As  mentioned,  these  examples  also  serve  as  templates  for 
the  user  to  bootstrap  their  own  first  order  time  integration  methods  to  give  a  parallel- 
in-time  approximation.  Good  parallel  speedup  is  observed  when  the  computational 
overhead  for  the  RIDC  methods  (namely,  the  quadrature  approximation  and  the  linear 
combinations  to  compute  the  corrected  solutions)  is  small  compared  to  an  evaluation 
of  the  step  routine.  Here,  we  present  the  numerical  results  for  the  Brusselator  in  R1. 

ut  =  A  +  u2v  —  (B  +  1  )u  +  auxx,  (6) 

Vt  =  Bu  —  u2v  +  avxx, 
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with  A  =  1,  B  =  3  and  a  =  0.02,  initial  conditions 

u(  0,  x)  =  1  +  sin(27ra:),  i>(0,  x)  =  3, 

and  boundary  conditions 

u(t ,  0)  =  u(t ,  1)  =  1,  v(t ,  0)  =  v(t ,  1)  =  3. 

A  central  finite  difference  approximation  is  used  to  discretize  equation  (6).  The  re¬ 
sulting  nonlinear  system  of  equations  is  solved  using  a  Newton  iteration.  In  the  tim¬ 
ing  results,  the  Intel  Math  Kernel  Library  (MKL)  is  used  to  solve  the  linear  sys¬ 
tem  arising  in  each  Newton  iteration.  The  code  for  this  example  can  be  found  in  the 
examples/brusselator_mkl  directory.  Figure  6  shows  a  standard  convergence  study  of 
error  versus  number  of  timesteps  to  demonstrate  that  the  RIDC  software  bootstraps 
the  first  order  integrator  to  generate  a  high-order  method  of  the  desired  accuracy. 


At 


Fig.  6.  Standard  convergence 
study  of  error  versus  time  step, 
At,  showing  that  RIDC  meth¬ 
ods  achieve  their  designed  or¬ 
ders  of  accuracy. 


In  Figure  7,  the  walltime  used  to  compute  each  ride  method  is  plotted  to  show 
the  “weak  scaling”  capability  of  RIDC  methods.  For  example,  the  fourth-order  RIDC 
method  computes  a  solution  using  four  computing  cores  that  is  3-5  orders  of  magnitude 
more  accurate  than  the  first  order  Euler  solution  in  approximately  the  same  wallclock 
time.  Tming  results  using  a  serial  three-stage,  fifth-order  RADAU IIA  integrator  is  also 
presented.  A  fifth  order  RIDC  method  (with  five  computing  cores)  provides  a  solution 
with  comparable  accuracy  in  10%  of  the  walltime.  The  scaling  studies  were  performed 
on  a  single  computational  node  consisting  of  a  dual  socket  Intel  E5-2670v2  chipset. 

5.  CONCLUSIONS 

In  this  paper,  we  presented  the  revisionist  integral  deferred  correction  (RIDC)  soft¬ 
ware  for  solving  systems  of  initial  values  problems.  The  approach  bootstraps  lower  or¬ 
der  time  integrators  to  provide  high  order  approximations  in  approximately  the  same 
wall  clock  time,  providing  a  multiplicative  increase  in  the  number  of  compute  cores  uti¬ 
lized.  The  C++  framework  produces  a  parallel-in-time  solution  of  a  system  of  initial 
value  problems  given  user  supplied  code  for  the  right  hand  side  of  the  system  and  the 
sequential  code  for  a  first-order  time  step.  The  user  supplied  time  step  routine  may  be 
explicit  or  implicit  and  may  make  use  of  any  auxiliary  libraries  which  take  care  of  the 
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— Euler 
RIDC-2 
RIDC-3 
RIDC-4 
RIDC-5 

--•--Radau  IIA  (5) 


Fig.  7.  The  error  as  a  function  of  walltime  is  plotted  for  various  RIDC  methods.  Here,  two  computing  cores 
(set  via  0MP_NUM_THREADS=2)  is  used  to  compute  the  second  order  RIDC  method  (RIDC-2),  three  computing 
cores  are  used  to  compute  RIDC-3,  four  computing  cores  are  used  to  compute  RIDC-4,  and  give  compute 
cores  are  used  to  compute  RIDC-5.  A  single  computing  core  was  used  to  compute  Radau  IIA.  The  RIDC 
software  computes  a  pth  order  solution  in  approximately  the  same  wall  clock  time  as  an  Euler  solution, 
provided  p  computing  cores  are  available.  The  parallel  RIDC  methods  also  provide  good  speedup  over  a 
serial  Radau  IIA  integrator. 


solution  of  the  nonlinear  algebraic  systems  which  arise  or  the  numerical  linear  algebra 
required. 
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1  Introduction 

In  this  paper  we  consider  the  phase  retrieval  for  sparse  signals  with  noisy  measure¬ 
ments,  which  arises  in  many  different  applications.  Assume  that 

bj  :=  \(aj,xo)\  +ej,  j  —  1, . . . ,  m 

where  xq  g  M/V,  a  j  e  E'v  and  ej  e  1  is  the  noise.  Our  goal  is  to  recover  xq  up  to 
a  unimodular  scaling  constant  from  b  :=  {b\, ,  bm)T  with  the  assumption  of  xq 
being  approximately  ^-sparse.  This  problem  is  referred  to  as  the  compressive  phase 
retrieval  problem  [9]. 

The  paper  attempts  to  address  two  problems.  Firstly  we  consider  the  stability  of 
l\  minimization  for  the  compressive  phase  retrieval  problem  where  the  signal  xq  is 
approximately  ^-sparse,  which  is  the  t  \  minimization  problem  defined  as  follows: 

min  || jc || i  subject  to  ||  |Ajc[  —  |Ajco|  ||0  <  e,  (1.1) 

where  A  :=  [a\, am]T  and  |Ajco|  :=  [|(fli,  Jto)|, . . . ,  |(aw,  xo)|]T.  Secondly  we 
investigate  instance-optimality  in  the  phase  retrieval  setting. 

Note  that  in  the  classical  compressive  sensing  setting  the  stable  recovery  of  a  k- 
sparse  signal  jco  e  CN  can  be  done  using  m  =  (D(k\og(N  /  k))  measurements  for 
several  classes  of  measurement  matrices  A.  A  natural  question  is  whether  stable  com¬ 
pressive  phase  retrieval  can  also  be  attained  with  m  =  0(k  log (N/k))  measurements. 
This  has  indeed  proved  to  be  the  case  in  [6]  if  xq  g  E'v  and  A  is  a  random  real  Gaussian 
matrix.  In  [8]  a  two-stage  algorithm  for  compressive  phase  retrieval  is  proposed,  which 
allows  for  very  fast  recovery  of  a  sparse  signal  if  the  matrix  A  can  be  written  as  a  prod¬ 
uct  of  a  random  matrix  and  another  matrix  (such  as  a  random  matrix)  that  allows  for 
efficient  phase  retrieval.  The  authors  proved  that  stable  compressive  phase  retrieval 
can  be  achieved  with  m  =  Oik  \og{N  /  k))  measurements  for  complex  signals  xq  as 
well.  In  [10],  the  strong  RIP  (S-RIP)  property  is  introduced  and  the  authors  show  that 
one  can  use  the  t  \  minimization  to  recover  sparse  signals  up  to  a  global  sign  from  the 
noiseless  measurements  |  Ajco  I  provided  A  satisfies  S-RIP.  Naturally,  one  is  interested 
in  the  performance  of  t  \  minimization  for  the  compressive  phase  retrieval  with  noisy 
measurements.  In  this  paper,  we  shall  show  that  the  l\  minimization  scheme  given  in 
(1.1)  will  recover  a  ^-sparse  signal  stably  from  m  =  0(k  \og(N  /  k))  measurements, 
provided  that  the  measurement  matrix  A  satisfies  the  strong  RIP  (S-RIP)  property. 
This  establishes  an  important  parallel  for  compressive  phase  retrieval  with  the  classi¬ 
cal  compressive  sensing.  Note  that  in  [11]  such  a  parallel  in  terms  of  the  null  space 
property  was  already  established. 

The  notion  of  instance  optimality  was  first  introduced  in  [5].  We  use  ||x  [|o  to  denote 
the  number  of  non-zero  elements  in  x.  Given  a  norm  ||  •  ||x  such  as  the  l  \  -norm  and 
x  G  R;V ,  the  best  k- term  approximation  error  is  defined  as 

0k(x)x  ■=  min  ||jc  —  z\\x, 

zeTk 
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where 


X*  :=  [x  e  Rn  :  ||jc||0  <  k}. 

We  use  A  :  Wn  i->  R'v  to  denote  a  decoder  for  reconstructing  x.  We  say  the  pair 
(A,  A)  is  instance  optimal  of  order  k  with  constant  Cq  if 

II*  -  A(Ax)||x  <  C0crk(x)x  (1-2) 

holds  for  all  x  e  RiV .  In  extending  it  to  phase  retrieval,  our  decoder  will  have  the 
input  b  =  |Ajc|.  A  pair  (A,  A)  is  said  to  be  phaseless  instance  optimal  of order  k  with 
constant  Co  if 

minjll*  -  A(|A*|)||x,  ||*  +  A(|Ax|)[|*J  <  C0ak(x)x  (1.3) 

holds  for  all  x  e  M'v.  We  are  interested  in  the  following  problem  :  Given  ||  •  ||z  and 
k  <  N,  what  is  the  minimal  value  of  m  for  which  there  exists  (A,  A)  so  that  (1.3) 
holds? 

The  null  space  A f  (A)  :=  {x  e  :  Ax  =  0}  of  A  plays  an  important  role  in  the 
analysis  of  the  original  instance  optimality  (1.2)  (see  [5]).  Here  we  present  a  null  space 
property  for  AT  (A),  which  is  necessary  and  sufficient,  for  which  there  exists  a  decoder 
A  so  that  (1.3)  holds.  We  apply  the  result  to  investigate  the  instance  optimality  where 
X  is  the  l\  norm.  Set 

Ai(|Ax|)  :  =  argminj  ||z||i  :  |Ax|  =  |Az|  |. 

1  J 

We  show  that  the  pair  (A,  Ai)  satisfies  (1.3)  with  X  being  the  l i-norm  provided  A 
satisfies  the  strong  RIP  property  (see  Definition  2.1).  As  shown  in  [10],  the  Gaussian 
random  matrix  A  e  Mm  x  N  satisfies  the  strong  RIP  of  order  k  for  m  =  Oik  log (N/k). 
Hence  m  =  0(k  \og(N / k))  measurements  suffice  to  ensure  the  phaseless  instance 
optimality  (1.3)  for  the  l  \  -norm  exactly  as  with  the  traditional  instance  optimality 
(1.2). 

2  Auxiliary  Results 

In  this  section  we  provide  some  auxiliary  results  that  will  be  used  in  later  sections. 
For  x  e  we  use  1 1 jc  1 1 p  :=  ||x:||^  to  denote  the  p-norm  of  x  for  0  <  p  <  oo.  The 
measurement  matrix  is  given  by  A  :=  [r?i, . . . ,  am\T  e  M'?! x ,v  as  before.  Given  an 
index  set  I  c  { 1 ,  . . . ,  m }  we  shall  use  A/  to  denote  the  sub-matrix  of  A  where  only 
rows  with  indices  in  I  are  kept,  i.e., 

A/  :=  [aj  :  j  e  /]T. 
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The  matrix  A  satisfies  the  Restricted  Isometry  Property  (RIP)  of  order  k  if  there  exists 
a  constant  §k  e  [0,  1)  such  that  for  all  ^-sparse  vectors  z  e  X/,-  we  have 

(1  -  5jfe)]|z||i  <  ]|Az|||  <  (1+  5jfc)l|z]li- 


It  was  shown  in  [2]  that  one  can  use  l\ -minimization  to  recover  ^-sparse  signals 
provided  that  A  satisfies  the  RIP  of  order  tk  and  8tk  <  ^1  —  \  where  t  >  1. 

To  investigate  compressive  phase  retrieval,  a  stronger  notion  of  RIP  is  given  in  [  1 0] : 

Definition  2.1  ( S-RIP )  We  say  the  matrix  A  =  [a\, . . . ,  am]T  e  Mmx  N  has  the  Strong 
Restricted  Isometry  Property  of  order  k  with  bounds  6-,  9+  e  (0,  2)  if 


0_||*[l?  5  min  ||A/; 

IQ[m\,\I\>m/2 


II 2  — 


II  -1  llZ 


holds  for  all  k-sparse  signals  x  e  M'v,  where  [m]  :=  {l, ...  ,m}.  We  say  A  has  the 
Strong  Lower  Restricted  Isometry  Property  of  order  k  with  bound  6L  if  the  lower 
bound  in  (2.1)  holds.  Similarly  we  say  A  has  the  Strong  Upper  Restricted  Isometry 
Property  of  order  k  with  bound  9+  if  the  upper  bound  in  (2.1)  holds. 

The  authors  of  [10]  proved  that  Gaussian  matrices  with  m  =  (D(tk\og(N / k)) 
satisfy  S-RIP  of  order  tk  with  high  probability. 

Theorem  2.1  ([10])  Suppose  that  t  >  1  and  A  =  (aij)  e  WnxN  is  a  random 
Gaussian  matrix  with  m  =  (D(tk\og(N / k))  and  atj  ~  J\f(0,  -^=).  Then  there  exist 
9-,  #4-  €  (0,  2)  such  that  with  probability  1  —  exp(— cm/2)  the  matrix  A  satisfies  the 
S-RIP  of  order  tk  with  constants  9-  and  9+,  where  c  >  0  is  an  absolute  constant  and 
9-,  9 +  are  independent  of  t. 

The  following  is  a  very  useful  lemma  for  this  study. 

Lemma  2.1  Let  xq  e  and  p  >  0.  Suppose  that  A  e  WnxN  is  a  measurement 
matrix  satisfying  the  restricted  isometry  property  with  8tk  <  -Jr—p-  for  some  t  >  1. 
Then  for  any 


X  €  1 X  € 


oN 


l*lll  <  ||*0 111  +  \\Ax  -  ^*0 II 2  <  € 


we  have 


^  ,  2<Tk(x  o)i 

*  -  *0 1 1 2  <  Cie  +  C2 - -j= - b  c  2  • 

\Jk 


P 

yr 


where  c\  = 


f 2(1+8)  ^  _  s/2&+ y/ (fW^T)-8t)&  1 

l-ft/(t-l)8’  1  4Uf=\)-8t 


Birkhauser 


151 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


J  Fourier  Anal  Appl 


Remark  2.1  We  build  the  proof  of  Lemma  2.1  following  the  ideas  of  Cai  and  Zhang 
[2],  The  full  proof  is  given  in  Appendix  for  completeness.  It  is  well-known  that  an 
effective  method  to  recover  approximately-sparse  signals  xq  in  the  traditional  com¬ 
pressive  sensing  is  to  solve 

x#  :=  argmin{[|x||i  :  \\Ax  -  Ax o ||2  <  e}.  (2.2) 

The  definition  of  x#  shows  that 

lk#llt  <  Ml,  \\Ax# -Ax0\\2<e, 


which  implies  that 


1  #  1 

\X  —  XQ\ 


<  C  ic  +  C  2 


Q~*(*o)l 

\fk 


provided  that  A  satisfies  the  RIP  condition  with  8tk  <  V 1  —  1  / r  for  t  >  1  (see  [2]). 
However,  in  practice  one  prefers  to  design  fast  algorithms  to  find  an  approximation 
solution  of  (2.2),  say  x.  Thus  it  is  possible  to  have  || Jc  ||  1  >  ||xolli-  Lemma  2.1  gives 
an  estimate  of  ||Jc  —  xoll 2  for  the  case  where  ||jc||i  <  ||jcoIIi  +  p. 

Remark  2.2  In  [7],  Han  and  Xu  extend  the  definition  of  S-RIP  by  replacing  the  mil 
in  (2.1)  by  /3m  where  0  <  (3  <  1.  They  also  prove  that,  for  any  fixed  ft  e  (0,  1), 
the  m  x  N  random  Gaussian  matrix  satisfies  S-RIP  of  order  k  with  high  probability 
provided  m  =  0(klog(N / k)). 


3  Stable  Recovery  of  Real  Phase  Retrieval  Problem 

3.1  Stability  Results 

The  following  lemma  shows  that  the  map  cpA  (jc)  :=  \Ax\  is  stable  on  modulo  a 
unimodular  constant  provided  A  satisfies  strong  lower  RIP  of  order  2k.  Define  the 
equivalent  relation  ~  on  EA?  and  C'"v  by  the  following:  for  any  x,  y,  x  ~  y  iff  x  =  cy 
for  some  unimodular  scalar  c,  where  x,  y  are  in  E;V  or  C;V .  For  any  subset  Y  of  E,v 
or  CN  the  notation  Y /  ~  denotes  the  equivalent  classes  of  elements  in  Y  under  the 
equivalence.  Note  that  there  is  a  natural  metric  D  ,  on  C;V/  ~  given  by 

D~(x,  y)  =  min  ||x  —  cy||. 
kl=t 


Our  primary  focus  in  this  paper  will  be  on  E'v,  and  in  this  case  D~(x,  y )  =  min{||x  — 

y II2,  Ik  +  y  1121- 

Lemma  3.1  Let  A  e  WnxN  satisfy  the  strong  lower  RIP  of  order  2k  with  constant 
9-.  Then  for  any  x,  y  €  we  have 

II  \Ax  \  -  | Ay |  [| o  >  9-  min(||.r  -  y\\l,  \\x  +  y\\j). 
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Proof  For  any  x,  y  e  Y,k  we  divide  {1,  . . .  ,m\  into  two  subsets: 

T  =  { j  :  sign ((aj,x))  =  sign((ay,  y})} 


and 


Tc  =  {j  :  sign  ((aj,x))  =  -  sign  ((a,-,  y))}. 

Clearly  one  of  T  and  Tc  will  have  cardinality  at  least  m /2.  Without  loss  of  generality 
we  assume  that  T  has  cardinality  no  less  than  m/2.  Then 

II \Ax  \  -  | Ay 1 1| 2  =  || Ar*  -  ATyf2  +  ||Ar^  +  Arcy||2 

>  || Atx  -  Ajy\\\ 

>  d-\\x  -  y\\l 

>  9-  min(||x  -  y  |||,  \\x  +  y  |||). 


□ 

Remark  3.1  Note  that  the  combination  of  Lemma  3.1  and  Theorem  2.1  shows  that 
for  an  m  x  N  Gaussian  matrix  A  with  m  =  0(k  \og(N / k))  one  can  guarantee  the 
stability  of  the  map  4>a(x)  ■=  \Ax\  on  S^/ 


3.2  The  Main  Theorem 

In  this  part,  we  will  consider  how  many  measurements  are  needed  for  the  stable  sparse 
phase  retrieval  by  l\ -minimization  via  solving  the  following  model: 

min  || jc || i  subjectto  |||Ajc|  —  | Ajco I ||i  —  (3-1) 

where  A  is  our  measurement  matrix  and  xq  e  Ka;  is  a  signal  we  wish  to  recover.  The 
next  theorem  tells  under  what  conditions  the  solution  to  (3.1)  is  stable. 

Theorem  3.1  Assume  that  A  e  M'”x  ,v  satisfies  the  S-RIP  of  order  tk  with  bounds 
9-,  9+  €  (0,  2)  such  that 


t  >  maxi 


1 


•  29-  -  Bi  29 _ 


Then  any  solution  x  for  (3.1)  satisfies 


min{ || x  -  *oll2,  II*  +  *o II 2 }  <  cie  +  C2 


2(rk(x0)i 

Vk  ’ 


where  c\  and  C2  are  constants  defined  in  Lemma  2.1. 
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Proof  Clearly  any  x  e  satisfying  (3.1)  must  have 

Pill  <  Ml 


(3.2) 


and 

|| | Ax|  —  |Ax0[ H2  <  f2-  (3.3) 

Now  the  index  set  {1, 2,  . . . ,  m}  is  divisible  into  two  subsets 


T  =  {j  :  sign ((ajtx))  =  sign((a;-,  .v0))}, 

Tc  =  {j  :  sign ((ajtx))  =  -sign ((ay-,x0))}. 

Then  (3.3)  implies  that 

II Ajx  —  A^jco 111  +  \\Atcx  +  Atcx 0II2  <  e2.  (3.4) 

Here  either  |T|  >  m/2  or  \TC\  >  m/2.  Without  loss  of  generality  we  assume  that 
|T|  >  m/2.  We  use  the  fact 

\\ATx-ATx0\\l<e2.  (3.5) 


From  (3.2)  and  (3.5)  we  obtain 

x  e  {*  e  Rn  :  || jc ||  1  <  HjcoIIi,  ||Ayx  -  Atx 0II2  <  <?}  • 

Recall  that  A  satisfies  S-RIP  of  order  tk  and  constants  0_ ,  9+.  Here 

1  1 


t  >  max{- 


l2 9-  -  el'  26+  -el 

The  definition  of  S-RIP  implies  that  Ay  satisfies  the  RIP  of  order  tk  in  which 


-}  >  I- 


(3.6) 


(3.7) 


Stk  <  maxjl  -  9-,  9+  -  1}  < 


(3.8) 


where  the  second  inequality  follows  from  (3.7).  The  combination  of  (3.6),  (3.8)  and 
Lemma  2.1  now  implies 


x 


-  X0ll2  <  Cl€  +  C2 


2ak(xo)i 

■sfk 


where  c\  and  c2  are  defined  in  Lemma  2. 1 .  If  \TC\  >  “we  get  the  corresponding 
result 


+  -^0  II 2  <  C\€  +  C 2 


2ok{xo)\ 

vT 


The  theorem  is  now  proved. 


□ 
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This  theorem  demonstrates  that,  if  the  measurement  matrix  has  the  S-RIP,  the  real 
compressive  phase  retrieval  problem  can  be  solved  stably  by  l\ -minimization. 

4  Phase  Retrieval  and  Best  k-term  Approximation 

4.1  Instance  Optimality  from  the  Linear  Measurements 

We  introduce  some  definitions  and  results  in  [5] .  Recall  that  for  a  given  encoder  matrix 
A  G  Em x N  and  a  decoder  A  :  M"1  fa-  Ma?,  the  pair  (A,  A)  is  said  to  have  instance 
optimality  of  order  k  with  constant  Co  with  respect  to  the  norm  X  if 

\\x  -  X{Ax)\\x  <CQok{x)x  (4.1) 

holds  for  all  x  G  RN .  Set  Af  (A)  :=  {q  g  M.n  :  At)  =  0}  to  be  the  null  space  of  A.  The 
following  theorem  gives  conditions  under  which  the  (4.1)  holds. 

Theorem  4.1  ([5])  Let  A  G  WnxN,  1  <  k  <  N  and  ||  •  ||x  be  a  norm  on  .  Then  a 
sufficient  condition  for  the  existence  of  a  decoder  A  satisfying  (4.1)  is 

M\x<y °2k(ri)x,  V??  g  J\f(A).  (4.2) 

A  necessary  condition  for  the  existence  of  a  decoder  A  satisfying  (4.1)  is 

II ^ II A  <  C0o2k(v)x ,  Vri  G  A f(A).  (4.3) 

For  the  norm  X  —  t\  it  was  established  in  [5]  that  instance  optimality  of  order  k 
can  indeed  be  achieved,  e.g.  for  a  Gaussian  matrix  A,  with  m  =  0(k  log (N/k)).  The 
authors  also  considered  more  generally  taking  different  norms  on  both  sides  of  (4.1). 
Following  [5],  we  say  the  pair  (A,  A)  has  ( p ,  q)-instance  optimality  of  order  k  with 
constant  Co  if 

\\x- A(Ax)\\p  <C0kh-rak(x)q,  Vx  g  Rn ,  (4.4) 

with  \<q<P<2.  It  was  shown  in  [5]  that  the  ( p ,  q) -instance  optimality  of  order  k 
can  be  achieved  at  the  cost  of  having  m  =  0(k(N /  k)2~2^q)\og(N /  k)  measurements. 

4.2  Phaseless  Instance  Optimality 

A  natural  question  here  is  whether  an  analogous  result  to  Theorem  4.1  exists  for 
phaseless  instance  optimality  defined  in  (1.3).  We  answer  the  question  by  presenting 
such  a  result  in  the  case  of  real  phase  retrieval. 

Recall  that  a  pair  (A,  A)  is  said  to  be  have  the  phaseless  instance  optimality  of 
order  k  with  constant  Co  for  the  norm  || .  ||x  if 

minjllx  -  A(|Ax|)||x,  II*  +  A(|Ax|)[|x}  <  C0ak(x)x  (4.5) 

holds  for  all  x  G  RN . 
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Theorem  4.2  Let  A  e  WnxN,  1  <  k  <  N  and  ||  •  ||x  be  a  norm.  Then  a  sufficient 
condition  for  the  existence  of  a  decoder  A  satisfying  the  phaseless  instance  optimality 
(4.5)  is:  For  any  I  C  {1,  . . . ,  m}  and  rp  e  AT(Aj),  rj2  €  Af(Ajc)  we  have 

Co  Co 

min{||»7i||x,  ||^2llx}  <  —crk(Vi  ~  m)x  +  —  ok(r)\  +  r/2)x.  (4.6) 

A  necessary  condition  for  the  existence  of  a  decoder  A  satisfying  (4.5)  is:  For  any 
I  c  {1, . . . ,  m]  and  rj i  e  Af(Aj),  r)2  e  Af(Ajc)  we  have 

Co  Co 

min{||/7i||x,  \\r)2\\x}  <  — o*0 7i  ~  m)x  +  —  ofcOh  +  m)x-  (4.7) 

Proof  We  first  assume  (4.6)  holds,  and  show  that  there  exists  a  decoder  A  satisfying 
the  phaseless  instance  optimality  (4.5).  To  this  end,  we  define  a  decoder  A  as  follows: 

A(|A*0|)  =  argmin  ak{x)x. 

\Ax\=\Axo\ 

Suppose  x  :=  A ( | Ajco I ) -  We  have  \Ax\  =  |AjcoI  and  ak(x)x  <  ok{x o)x-  Note  that 
(aj,x)  =  ±(aj,  xq).  Let  I  c  {1,  . . . ,  m}  be  defined  by 


Then 


Set 


{./  :  {aj,x)  =  (a;-,x0)}. 


Ai(x  o  —  x)  =  0,  A/c  (xq  +  x)  =  0. 


m  :=  xq-  x  &  Af(Ai), 
m  '■=  xq+x  g  AffAjc). 


A  simple  observation  yields 


Okim  ~  Vl)x  -  2&k(x)x  <  2ak(xo)x,  crk(r)i  +  r) 2)X  =  2ak(x0)x.  (4.8) 

Then  (4.6)  implies  that 

min{||x  -  xollx,  II*  +*ollx}  =  min{ || ?7t llx,  \\m\\x} 

Cq  Cq 

<  —crk(r]i  -  7)2) x  +  —ak(r)\  +  t)2)x 

<  Co&k(xo)x. 


Here  the  last  equality  is  obtained  by  (4.8).  This  proves  the  sufficient  condition. 
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We  next  turn  to  the  necessary  condition.  Let  A  be  a  decoder  for  which  the  phaseless 
instance  optimality  (4.5)  holds.  Let  I  c  {1, . . . ,  m}.  For  any  77 i  e  Af(Aj)  and  77 2  e 
A f{A[c)  we  have 

\Mm  +  *n)\  =  \Mm  -  m)\  =  \A{m  -  m) \.  (4.9) 

The  instance  optimality  implies 

min{||A(|A(?]i  +  ?;2)l)  +  m  +  Jn\\x,  \\A(\A(rn  +  77 2)1)  -  (m  +  V2)\\x} 

<  C0cTk(rii  +  r]2)x-  (4.10) 

Without  loss  of  generality  we  may  assume  that 

I|A(|A(?7i  +  r}i)\)  +  iU  +  rj2\\x  <  II  A(|A(jji  +  ij2)\)  -  (771  +  rj2)\\x. 

Then  (4.10)  implies  that 

II A(|A(tji  +  rj2) |)  +  rn+  r\2 \\x  <  CQak{rj i  +  rj2)x.  (4.11) 

By  (4.9),  we  have 

II A(|A(?71  +  T]2) I)  +  Tji  +  ?72llx  =  II A(|A(?72  -  »?i)l)  -  im  -  m)  +  2rn\\x 

>  2 1| ?72 1| x  -  l|A(|A072  -  j/i)|)  -  (V2  -  ^?i)llx- 

(4.12) 


Combining  (4.11)  and  (4.12)  yields 

2|| ??2 llx  <  Cocrk(rn  +  r)2)x  +  II  A(| A(ij2  -  ?? i)|)  -  (r}2  -  r]i)\\x.  (4.13) 

At  the  same  time,  (4.9)  also  implies 

II A(|A(?7i  +  ?72)l)  +  m  +  772IIX  =  II A (| A (772  -  771)1)  +  (772  -  771)  +  2771 1| x 

>  2 1| 771  ||x  -  II  A(|A(?72  -  771) |)  +  (772  -  T7i)||x- 

(4.14) 


Putting  (4.1 1)  and  (4.14)  together,  we  obtain 

2 1| 771  ||x  <  C0o-jt(?n  +  m)x  +  II A (| A (772  -  T7i)|)  +  (772  -  m)\\x-  (4.15) 

It  follows  from  (4.13)  and  (4.15)  that 

C0 

mm{||?7i||x,  II 772 II xl  <  -2ak(V  1  +  r}2)x 

+  -  min{||  A(|A(?72  -  ?7i)|)-(?72-??i)IIa,  ||A(|A(?72  -  77 1 ) | ) 
Co  Co 

+  (772  -  77OIU}  <  —^(771  +  772) x  +  -  m)x- 
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Here  the  last  inequality  is  obtained  by  the  instance  optimality  of  (A,  A).  For  the  case 
where 


\\A(\A(rn  +  ri2)\)  -  (ih  +  ri2)\\x  <  II A(|A(^i  +  rj2)\)  +  m  +  rj2\\x, 
we  obtain 


min{|| 771  ||x,  WmWx)  <  — +  m) x  +  —ak(m  -  m) x 

via  the  same  argument.  The  theorem  is  now  proved.  □ 

We  next  present  a  null  space  property  for  phaseless  instance  optimality,  which 
allows  us  to  establish  parallel  results  for  sparse  phase  retrieval. 

Definition  4.1  We  say  a  matrix  A  e  Mmx  ,v  satisfies  the  strong  null  space  property 

(S-NSP)  of order  k  with  constant  C  if  for  any  index  set  I  C  {1 . m }  with  |/|  >  m/2 

and  r)  e  A /’(A/)  we  have 


\\ri\\x  <C  ■  <7k(p)x. 

Theorem  4.3  Assume  that  a  matrix  A  e  /7r/.v  the  strong  null  space  property  of 

order  2k  with  constant  Co/2.  Then  there  must  exist  a  decoder  A  having  the  phaseless 
instance  optimality  (1.3)  with  constant  Co-  In  particular,  one  such  decoder  is 

A(|Axo|)  =  argmin  ak{x)x- 

\Ax\=\Axq\ 


Proof  Assume  that  I  c  {1,  . . . ,  m}.  For  any  771  e  Af(Aj)  and  r) 2  e  Af(Ajc)  we  must 
have  either  ||??illx  <  %-^2k(hi)x  or  \\tj2Wx  <  by  the  strong  null  space 

property.  If  \\rn\\x  <  then 


Cq  Co  Co 

WmWx  <  -^GikivOx  <  —  afcfai  -  ?72)z  +  —  +  m)x- 


Similarly  if  ||?;2l|x  <  ^f-<J2kfl2)x  we  will  have 


Co  Co  Co 

II >72 Ik  <  —aik(ri2)x  <  —  cr/t(7?l  -  m)x  +  —^(771  +  T] 2)x- 


It  follows  that 


mindkilk,  Ik2lk}  <  —  crk(vi  ~  m)x  +  —cfkim  +  m)x-  (4.16) 

Theorem  4.2  now  implies  that  the  required  decoder  A  exists.  Furthermore,  by  the 
proof  of  the  sufficiency  part  of  Theorem  4.2, 
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A(|Ax0|)  =  argmin  ok{x)x 

|Ax|=|A*o| 


is  one  such  decoder. 


□ 


4.3  The  Case  X  =  t\ 

We  will  now  apply  Theorem  4.3  to  the  l\ -norm  case.  The  following  lemma  establishes 
a  relation  between  S-RIP  and  S-NSP  for  the  i i-norm. 

Lemma  4.1  Let  a,  b,k  be  integers.  Assume  that  A  e  WnxN  satisfies  the  S-RIP  of 
order  ( a  +  b)k  with  constants  0-,  9+  e  (0,  2).  Then  A  satisfies  the  S-NSP  of  order 
ak  under  the  l\-norm  with  constant 


Co  =  1  + 


a(l  +  8) 
b{  1  -  5)’ 


where  8  is  the  restricted  isometry  constant  and  8  :  =  max{l  —  0_,  0+  —  1}  <  1. 


We  remark  that  the  above  lemma  is  the  analogous  to  the  following  lemma  providing 
a  relationship  between  RIP  and  NSP,  which  was  shown  in  [5]: 


Lemma  4.2  ([5,  Lemma  4.1])  Let  a  =  l / k,  b  =  l' / k  where  1,1'  >  k  are  integers. 


Assume  that  A  e 


p  mxN 


satisfies  the  RIP  of  order  {a  +  b)k  with  8  =  8(a+b)k  <  1- 


Then  A  satisfies  the  null  space  property  under  the  l\-norm  of  order  ak  with  constant 


C0  =  1  + 


■s/  u(l~t~^) 


Proof  By  the  definition  of  S-RIP,  for  any  index  set  I  c  {1 , ,m}  with  |/|  >  m/2, 
the  matrix  A/  e  irI 7 1 xiV  satisfies  the  RIP  of  order  (a  +  b)k  with  constant  8(a+b)k  = 
8  :=  max{l  —  0_,  9+  —  1}  <  1.  It  follows  from  Lemma  4.2  that 


a(l+8)  \ 

for  all  rj  e  N(A  f).  This  proves  the  lemma.  □ 

Set  a  =  2  and  b  =  1  in  Lemma  4.1  we  infer  that  if  A  satisfies  the  S-RIP  of  order 
3k  with  constants  0_,  9+  e  (0,  2),  then  A  satisfies  the  S-NSP  of  order  2k  under 

the  l i-norm  with  constant  Co  =  1  +  J 2<11^/>  •  Hence  by  Theorem  4.3,  there  must 
exist  a  decoder  that  has  the  instance  optimality  under  the  l i-norm  with  constant  2Co- 
According  to  Theorem  2.1,  by  taking  m  =  0(k  log (N/k))  a  Gaussian  random  matrix 
A  satisfies  S-RIP  of  order  3k  with  high  probability.  Hence,  there  exists  a  decoder  A 
so  that  the  pair  (A,  A)  has  the  the  l i-norm  phaseless  instance  optimality  at  the  cost 
of  m  =  0(k  \og(N / k))  measurements,  as  with  the  traditional  instance  optimality. 

We  are  now  ready  to  prove  the  following  theorem  on  phaseless  instance  optimality 
under  the  l i-norm. 
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Theorem  4.4  Let  A  e  M'J!  x  N  satisfy  the  S-RIP  of  order  tk  with  constants  0  <  9-  < 
1  <  0+  <  2,  where 


t  >  max 


2  2 
el'  2-o+ 


>  2. 


Let 


A(|Axol)  =  argmin  {\\x\h  :  \Ax\  =  |Axo|}.  (4.17) 

xeRN 

Then  (A,  A)  has  the  l\-norm phaseless  instance  optimality  with  constant  C  =  lC+0, 
where  Co  =  1  +  J and  as  before 

2 

S  :=  max{l  —  0-,  0+  —  1}  <  1  —  -. 

Proof  of  Lemma  4.1  Let  xo  e  and  set  x  =  A(|  Axol).  Then  by  definition 

ll^lli  <  [Uolli  and  |Ax|  =  |Ax0|- 
Denote  by  I  c  { 1 ,  . . . ,  m}  the  set  of  indices 

J  =  {./  :  {aj,x)  =  ( aj,x0 )}  , 
and  thus  ( aj ,  x)  =  —  ( aj ,  xq)  for  j  e  Ic .  It  follows  that 

A/(x  —  xq)  =  0  and  A/c(x+x o)  =  0. 


Set 


7)  :=  x  —  xq  e  Af(Aj). 


We  know  that  A  satisfies  the  S-RIP  of  order  tk  with  constants  6L ,  6+  where 


t  >  max 


2  2 
el'  2-0+ 


>  2. 


For  the  case  |/|  >  m/2,  A  j  satisfies  the  RIP  of  order  tk  with  RIP  constant 

2 

S  =  8tk  :=  max{l  —  0-,0+  —  1}  <  1  —  -  <  1. 


Take  a  :=  1,  b  :=  t  —  1  in  Lemma  4.1.  Then  A  satisfies  the  t  i-norm  S-NSP  of  order 
k  with  constant 
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Co  =  1  + 


1  +  5 


(f-l)(l-S) 


<  2. 


This  yields 


IMIt  <  C0\\r]Tc\\i,  (4.18) 

where  T  is  the  index  set  for  the  k  largest  coefficients  of  jco  in  magnitude.  Hence 
II II  l  <  (Co  -  l)lh?rclli-  Since  ||x||i  <  ||*0||i  we  have 

11*0  111  >  II*  111  =  ll*o  +  Vh  =  11*0, r  +  *o,rc  +  rjj  +  ??rclli 
>  ll*o,rlli  —  ll*o, rc Hi  +  ll'7rclli  —  II »7r Hi¬ 


lt  follows  that 

\\f]Tch  <  ll^r  111  +2crk(xo)i  <  (Co  -  l)||??Hli  +  2o*(*0)i 


and  thus 


l|i?rclli  < 


2 

2 -Co 


Ojfc(* o)l- 


Now  (4.18)  yields 


IMIi  <  Co||??Hli 


< 


2C0 
2 -Co 


&k(x o)l  ; 


which  implies 


*  -  *olli  <  Co  ||  ??rc  II  l 


< 


2C0 
2 -Co 


Ojfc(*  o)l- 


For  the  case  \IC\  >  m/2  identical  argument  yields 


*  +*olll  <  C0 1|  ||i 


< 


2Cp 
2 -C0 


^(*o)i- 


The  theorem  is  now  proved.  □ 

By  Theorem  2.1,  an  m  x  N  random  Gaussian  matrix  with  m  =  0(tk\og(N  /  k)) 
satisfies  the  S-RIP  of  order  tk  with  high  probability.  We  therefore  conclude  that  the 
l  i-norm  phaseless  instance  optimality  of  order  k  can  be  achieved  at  the  cost  of  m  = 
(D(tklog(N/k))  measurements. 
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4.4  Mixed-Norm  phaseless  Instance  Optimality 

We  now  consider  mixed-norm  phaseless  instance  optimality.  Let  1  <  q  <  p  <  2  and 
s  =  l/q  —  1/p.  We  seek  estimates  of  the  form 

min{||*  -  A(\Ax\)\\p,  \\x  +  A(\Ax\)\\p)  <  C0k~sak(x)q  (4.19) 

for  all  x  e  M.N .  We  shall  prove  both  necessary  and  sufficient  conditions  for  mixed- 
norm  phaseless  instance  optimality. 

Theorem  4.5  Let  A  e  WnxN  and  1  <  q  <  P  <  2.  Sets  =  l/q  —  l/p.  Then  a  decoder 
A  satisfying  the  mixed  norm  phaseless  instance  optimality  (4.19)  with  constant  Co 
exists  if:  for  any  index  set  I  c  {1 ,  ,m}  and  any  q\  e  J\f(Aj),  q  2  e  J\f(Ajc)  we 

have 


min{||)7i||p,  \\q2Wp}  <  ~^~k~s  (ak(qi  -  q  2)q  +  &k(q  1  +  V2  )<?)•  (4.20) 

Conversely,  assume  a  decoder  A  satisfying  the  mixed  norm  phaseless  instance  opti¬ 
mality  (4.19)  exists.  Then  for  any  index  set  I  C  {1,  . . . ,  m]  and  any  q\  e  J\f{Ai), 
q 2  €  J\f{A[c)  we  have 

minflltjiUp,  ll^llp}  <  -yfc-5(oi(»n  -  qflq  +  ak(q\  +^2)?)- 

Proof  of  Lemma  4.1  The  proof  is  virtually  identical  to  the  proof  of  Theorem  4.2.  We 
shall  omit  the  details  here  in  the  interest  of  brevity.  □ 

Definition  4.2  (. Mixed-Norm  Strong  Null  Space  Property )  We  say  that  A  has  the  mixed 
strong  null  space  property  in  norms  (lp,lq)  of  order  k  with  constant  C  if  for  any  index 
set  I  c  {1, . . . ,  m}  with  |/|  >  m/2  the  matrix  A/  e  satisfies 

II h lip  <  Ck~sok(q)q 

for  all  q  e  A f (A j),  where  s  —  l/q  —  l/p. 

The  above  is  an  extension  of  the  standard  definition  of  the  mixed  null  space  property 
of  order  k  in  norms  (£p,  tq)  (see  [5])  for  a  matrix  A,  which  requires 

\\v\\p  <  Ck~s ak{q)q 

for  all  q  e  A /"(A).  We  have  the  following  straightforward  generalization  of  Theorem 
4.3. 

Theorem  4.6  Assume  that  A  e  WnxN  has  the  mixed  strong  null  space  property  of 
order  2k  in  norms  (lp,  lq)  with  constant  Co/2,  where  1  <  q  <  p  <  2.  Then  there 
exists  a  decoder  A  such  that  the  mixed-norm  phaseless  instance  optimality  (4.19) 
holds  with  constant  Cq. 
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We  establish  relationships  between  mixed-norm  strong  null  space  property  and  the 
S-RIP.  First  we  present  the  following  lemma  that  was  proved  in  [5]. 


Lemma  4.3  ([5,  Lemma  8.2])  Let  k  >  1  and  k  =  \k(j-)2  2^q\  Assume  that  A  e 


WnxN  satisfies  the  RIP  of  order  2 k  +  k  with  8  :=  S0Jc+j:  <  1.  Then  A  satisfies 
the  mixed  null  space  property  in  norms  (tp,lq)  of  order  2k  with  constant  Cq  = 


21/p+1/2  yr±|  +  2l/p-l/9. 


Proposition  4.1  Let  k  >  1  and  k  =  \k(j-)2  2/<? ] .  Assume  that  A  e  W'lxN  satisfies 
the  S-RIP  of  order  2 k  +  k  with  constants  0  <  6-  <  1  <  0+  <  2.  Then  A  satisfies 
the  mixed  strong  null  space  property  in  norms  (£p,lq)  of  order  2k  with  constant 

Cq  =  21/ P+l/2^ ' -f  where  8  is  the  RIP  constant  and  8  :=  81/c+ £  = 

max{l  —  0-,  9+  —  1}. 


Proof  of  Lemma  4.1  By  dehnition  for  any  index  set  I  c  {1  ,...,m}  with  |/|  > 
m/2,  the  matrix  Aj  e  E^x'v  satishes  RIP  of  order  2 k  +  k  with  constant  Cq  = 

2Cp+i/2Ji_±s  _|_  21/p-i/?j  where  <5  is  the  RIP  constant  and  5  :=  =  maxU  — 

6-,6+  —  1}.  By  Lemma  4.3,  we  know  that  Aj  satishes  the  mixed  null  space  property 
in  norms  (lp,  lq)  of  order  2k  with  constant  Cq  =  21/ 1/2-y/ +  21//p-1/,?,  in  other 
words  for  any  //  e  fif (A /), 


II?? Up  <  C£  s(r2k(ri)q. 

So  A  satishes  the  mixed  strong  null  space  property.  □ 

Corollary  4.1  Let  k  >  I  and  k  =  k(jr)2~2^q ■  Assume  that  A  e  satisfies  the 

S-RIP  of  order  2k  +  /:  with  constants  0  <  0_  <  1  <  0+  <  2.  Let  5  :=  <52yt+£  = 
max{l  —  0_,  0+  —  1}  <  1.  Define  the  decoder  A.  for  A  by 

A(|A*o|)  =  argmin  Ok{x)q.  (4.21) 

\Ax\=\Ax0\ 


Then  (4.19)  holds  with  constant  2Cq,  where  Cq  =  2 +  2l///,  l/,<?. 

Proof  of  Lemma  4.1  By  the  Proposition  4.1,  the  matrix  A  satishes  the  mixed  strong 
null  space  property  in  (lp,  lq)  of  order  2k  with  constant  Co  =  2] / P+C2 _^J _j_ 
2 !/p- 1  At .  The  corollary  now  follows  immediately  from  Theorem  4.6.  □ 

Remark  4. 1  Combining  Theorem  2.1  and  Corollary  4.1,  the  mixed  phaseless  instance 
optimality  of  order  k  in  norms  (£p ,  lq )  can  be  achieved  for  the  price  of  0{k(N  /  k)2  2/V/ 
log(A?/£))  measurements,  just  as  with  the  traditional  mixed  (£p,  iq)~ norm  instance 
optimality.  Theorem3.1  implies  thatthefi  decoder  satishes  the  (p,  q)  =  (2,  ^mixed- 
norm  phaseless  instance  optimality  at  the  price  of  0(k  \og(N / k))  measurements. 
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Appendix:  Proof  of  Lemma  2.1 

We  will  first  need  the  following  two  Lemmas  to  prove  Lemma  2.1. 

Lemma  5.1  (Sparse  Representation  of  a  Polytope  [2, 12])  Let  s  >  1  and  a  >  0.  Set 

T(a,  s)  :=  |  u  el"  :  ||w||oo  <  a,  ||w||i  <  i’O'J. 

For  any  v  e  R”  let 


U(a,  s,  v )  :=  |w  e  R"  :  suppiu)  c  supp(v),  ||w||o  <  s,  ||w||i  =  [|n||i,  ||n||oo  <  of  j. 

Then  v  e  T(a,  5)  if  and  only  if  v  is  in  the  convex  hull  of  U(a,  s,  v),  i.e.  v  can  be 
expressed  as  a  convex  combination  of  some  u\, . . .  ,un  in  U(a,  s ,  v). 

Lemma  5.2  ([1,  Lemma  5.3])  Assume  that  a\  >  02  >  •  •  •  >  am  >  0.  Let  r  <  m  and 
X  >0  such  that  X/=i  ai  +  ^  —  X/=r+i  ai-  Then  for  all  a  >  1  we  have 


m 


j=r+ 1 


In  particular  for  X  =  Owe  have 


(5.1) 


m  r 

X 

j=r+l  i=  1 

We  are  now  ready  to  prove  Lemma  2.1. 

Proof  Set  h  =  x  —  xq.  Let  To  denote  the  set  of  the  largest  k  coefficients  of  xq  in 
magnitude.  Then 


ll*o Hi  +  P  >  II* Hi  =  ll*o  +  h\\i 

=  ll*o,7b  +  hy  +  *o,r0c  +  h  7 ’c  ||  1 
>  ll*o, 7b Hi  ~  Pto II 1  -  ll*o,r0clli  +  Pr0cllt- 


It  follows  that 


Pr0elli  <  P 7b 111  +  2||xo,r0elli  +P 
=  Pr0  111  +  2or*(x0)i  +p. 
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Suppose  that  So  is  the  index  set  of  the  k  largest  entries  in  absolute  value  of  h .  Then 
we  can  get 


II^Sq II 1  <  ll^r0elli  <  \\hT0h  +  2crk(xo)i  +  p 
<  \\hs0 1|  i  +2crk(xo)i  +  p. 


Set 

P So  Hi  +  2<jk(xo)\  +  p 

a  :=  - . 

k 

We  divide  h $c  into  two  parts  h  $c  =  /?(l  1  +  hi2> ,  where 

hW  '■=  hsc0  ■  I{i--\hsc(i)\>a/(t-l)},  h{2)  :=  hsc  ■  I[r.\hsc(i)\<a/(t-l)}- 

A  simple  observation  is  that  ||/?(1)  || i  <  \\h ||  1  <  ak.  Set 

i  ■=  |supp(/2(1))|  =  ||/7(1)||o- 

Since  all  non-zero  entries  of  /?(1)  have  magnitude  larger  than  a/{t  —  1),  we  have 
ak  >  ||/7(1,||i  =  ^  |/7(1)(7')|  >  ^ 

i'esupp(/j(1))  i'esupp(7i(1)) 


which  implies  l  <  (t  —  \)k.  Thus  we  have: 

{A(hSo  T/7(1)),  Ah)  <  \\A(hSo+h(l))\\2  ■  \\Ah\\2  <  Vm-||/7s0+/i(1)ll2  ■  e. 

(5.2) 

Here  we  apply  the  facts  that  ||/7s0  +  h(V)  ||q  =  l  +  k  <  tk  and  A  satisfies  the  RIP 
of  order  tk  with  S  :=  8 A.  We  shall  assume  at  first  that  tk  as  an  integer.  Note  that 
l|/7(2)lloo<  7^1  and 


Oi-C-  Oi 

l|/i(2)lli  =  \\hsc\h  ~  l|/7(1)||i  ska-  —  =  (k(t  -  1)  -  D  — .  (5.3) 

We  take  s  :=  k(t  —  1)  —  t  in  Lemma  5.1  and  obtain  that  h(2)  is  a  weighted  mean 

N  N 

fi(2)  0  <  A/  <  1,  =  1 

i = l  i = l 

where  ||w,-  ||o  <  k(t  —  1)  —  l ,  ||w(||i  =  ||/7(2)||i,  ||m/||oo  <  a/{t  —  1)  and  supp(w,)  c 
supp(/?(2)).  Hence 
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1 1  W 1 1  2  ^  \f  1 1  M 1 1  0  ■  1 1  M 1 1  00  —  \/k(t  1)  ~t  •  1 1  U  i  1 1  00 

<  Jk(t  -  1)  •  || M;  ||  oo 

<  ajk/(t  —  1). 

Now  for  0  <  /i <  1  and  d  >  0,  which  will  be  chosen  later,  set 
Pj  :=  hs o  +  h(l)  +  fi-uj,  j  =  l, ,  N. 


Then  for  fixed  i  e  [  1 ,  /V  ] 

N 

Y.klfii  -  dPi  =  hSo  +  h(l)  +  ix  ■  h{2)  -  dPi 

j= 1 

=  (1  —  ii  —  d)(hs0  +  fi(1))  —  d/xuj  +  [ih. 

Recall  that  a  =  "K"1+2^(*o )i+p.  Thus 

l|M(ll2  <  y/k/(t  -  l)a  (5.4) 

,  II^So  II 2  2crk(x0)i  +  p 

~  y/t  —  1  y/k(t  —  1) 

<  Whsp  +  h ( 1}  || 2  +  2cr^(x0)i  +  p 
V^T  Vit(f  -  1) 
z  -\-  R 

=  vf=T’ 


where  z  :  =  ||/7.s„  +  || 2  and  R  :=  2g>('^1+p .  It  is  easy  to  check  the  following 

identity: 


(2d  -1)  ^  *.i*.j\\A(Pi-Pj)\\l 

1  <i  <  j<N 

N  N  2  N 

=  \\A(YkjPj  ~  Ml  -  -  d)2\\APi\\l  (5-5) 

;  =  1  j  =  1  ;  =  1 


provided  that  X/^=i  k,-  =  1.  Choose  d  —  1/2  in  (5.5)  we  then  have 


JV  1  2  N  x 

2>||A((2  ~  +  ^(1))  -  f «« +  ^)|“  -  114*11! 

( =  1  i  =  1 


0. 
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Note  that  for  d  =  1  /2, 


|  -  fi)(hso  +hil))  -  ^ ui  +  iih}  |  2 

=  ||a((^  -  fi)(hso  +h(1))  - 
+  2(A((^  -  M)(^50  +^(1))  -  ^ui)>  /xA/rj 

It  follows  from  X/I=i  ft  —  1  and  h{2)  =  XfLi  ft'Mi  that 
N  j 

2>|  A(^2  _  +^(1))  -  +  ph) 

i= 1 

= I  a(4  _  /x)(/75° + /7(1))  ~  f  m0  1 2 

7 


+  /x2||A/7|| 


l 

+  2(A(i~  -  ftK^So  +^(1))  -  ^h(2)^,  fiAh)  +  /x2||A/r || 

= xa'||a(4  _  ^(H+^»(i))  -  fM*)|2 


2 

2- 


2 

2 


+  /z(l  -  /x)(a(/75o  +  /7(1)),  Ah)  -  ]T  ^  II  Aft  III-  (5.6) 

1  =  1 

Set  /x  =  \/t(t  —  1)  —  (t  —  1).  We  next  estimate  the  three  terms  in  (5.6).  Noting  that 
\\hs0\\o  <  ft  ||/7(1)||o  <  f  and  ||m;  ||0  <  s  =  k{t  -  -  i,  we  obtain 


llftllo<  IIHHo+P(1)llo+ll«illo<r-fc 

and  ||  (j  —  fi)(hs0  +  ftA)  —  ^Uj  ||o  <  t  ■  k.  Since  A  satisfies  the  RIP  of  order  t  ■  k  with 
S,  we  have 

\A({X--ix){hs0  +  h^)-^ui)f2 

<  (l+Sml-ki)(hs0+hm)-^Ui\\l 

=  (1  +  S)((i  -  yu)2ll(/^0  +  ^(1))lli  +  ^IlMilll) 

=  (l+5)((^-/^)V  +  ^ll«ill|) 

and 


II Aft  ||2  >  (1  -  <5) || ft  || 2  =  (1  -  mhso  +  hA)\\22  +  !i2  •  \\ui\\l) 
=  (1  -  5)(z2  +  M2  -IlMflll). 
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Combining  the  result  above  with  (5.2)  and  (5.4)  we  get 


N  1  2 

0  <  (1  +  <$)  Z  Xi  ((-  -  nfz2  +  ^  ||  Ui  ||2)  +  nil  -  /x)vTTI  •  z  •  e 

i= 1 


N 


-a-*)Z>2  +  «) 


i= 1 


=  Z  ^  ((d  +  5)4  _  M)2  “  l~ir)z2  +  H2)  +  -  /^)vTTI  •  z  •  e 


(=1 

N 


//  1  9  1  — <5\  7  <5  9(z  +  /?)"\ 

<Iii(((i+i)(r«)  -  T)i  +2"'  ,-1  ) 

1  =  1 

+  /X(l  -  /Ji)V  1  +  5  •  Z  ■  € 

=  ((/t2  -  n)  +  i(l  -  n  +  (i  +  ^t^)m2))z2 

/  , -  Sfi2R\  S/i2R 2 

+  (pd-riVi+I.<  +  — )i  +  — 

t  -  1 


i((2r-l)-V'r('-l))( 


/  9  It  , -  8ix2R\  du~t 


<$)zz 

8/jL2R2 


2  (f  -  1) 


-  Vz2  +  (Vt(t-mi+8)e  +  8R)Z  +  ^-) 


<5fi2' 
2" 


which  is  a  quadratic  inequality  for  z.  We  know  5  <  ^/(f  —  l)/t.  So  by  solving  the 
above  inequality  we  get 

^  (Vr(f-l)(l+5)€  +  <57?)  +  ((Vt(f-  l)(l  +  5)e  +  5tf)2  +  2t(J(t-Y)/t  -  8)8R2)l/2 
Z  ~  2t(V (t  —  l/r)-5) 

Vf(*-  1 )  ( 1  +  <5)  „  25  +  ^2 t(V(t-  1)77-5)5 
“  f(V(?  -  l)/T-5)e+ 

Finally,  noting  that  \\hsc  111  <  ll^Sollt  +  Ryfk,  in  the  Lemma  5.2,  if  we  set  m  =  N, 
r  =  k,  X  =  R^/k  >  0  and  a  =  2  then  ||/i£c  || 2  <  \\hs0  II2  +  R-  Hence 

ii*ii2  =  y  \\hso  ui  +  iiAsgiii 

<\/ll*5olli  +  (ll*5oll2  +  ^)2 
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<  J^WhsoW^  +  ^  —  +  R 

V2(l  +  5)  (s/2  S  +  y/t(y/{t-  1  )/t-S)S  , 

\  KVO-  i)/r  —  5) 

Substitute  i?  into  this  inequality  and  the  conclusion  follows. 

For  the  case  where  t  ■  k  is  not  an  integer,  we  set  t*  :=  \tk~\fk,  then  t*  >  t  and 

8t*k  =  l)tk  <  J  <  yjt—pr--  We  can  then  prove  the  result  by  working  on  8t*k.  □ 
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We  consider  the  problem  in  determining  the  countable  sets 
A  in  the  time-frequency  plane  such  that  the  Gabor  system 
generated  by  the  time-frequency  shifts  of  the  window  X[o,i]d 
associated  with  A  forms  a  Gabor  orthonormal  basis  for 
L2(Md).  We  show  that,  if  this  is  the  case,  the  translates 
by  elements  A  of  the  unit  cube  in  R2d  must  tile  the  time- 
frequency  space  R2d.  By  studying  the  possible  structure  of 
such  tiling  sets,  we  completely  classify  all  such  admissible 
sets  A  of  time-frequency  shifts  when  d  —  1,2.  Moreover,  an 
inductive  procedure  for  constructing  such  sets  A  in  dimension 
d  >  3  is  also  given.  An  interesting  and  surprising  consequence 
of  our  results  is  the  existence,  for  d  >  2,  of  discrete  sets  A 
with  G(X[o,i]d  j  A)  forming  a  Gabor  orthonormal  basis  but  with 
the  associated  “time”-translates  of  the  window  X[o,i]d  having 
significant  overlaps. 
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1.  Introduction 

Let  g  be  a  non-zero  function  in  L2(M,d)  and  let  A  be  a  discrete  countable  set  on  M.2d, 
where  we  identify  M.2d  to  the  time-frequency  plane  by  writing  (t,  A)  G  A  with  t,  A  G 
The  Gabor  system  associated  with  the  window  g  consists  of  the  set  of  translates  and 
modulates  of  g: 


5(9,  A)  =  {e2"<A'*>9(x  -  t)  :  ((,  A)  £  A).  (1.1) 

Such  systems  were  first  introduced  by  Gabor  [5]  who  used  them  for  applications  in 
the  theory  of  telecommunication,  but  there  has  been  a  more  recent  interest  in  using 
Gabor  system  to  expand  functions  both  from  a  theoretical  and  applied  perspective.  The 
branch  of  Fourier  analysis  dealing  with  Gabor  systems  is  usually  referred  to  as  Gabor, 
or  time-frequency,  analysis.  Grochenig’s  monograph  6]  provide  an  excellent  and  detailed 
exposition  on  this  subject. 

Recall  that  the  Gabor  system  is  a  frame  for  L2(Rd)  if  there  exist  constants  A,  B  >  0 
such  that 


A\\f\\2<  Y,  \(f,e2^XAg(--t))\2<B\\f\\2,  f  G  L2(Rd).  (1.2) 

(t ,  A)  G  A 

It  is  called  an  orthonormal  basis  for  L2(JRd)  if  it  is  complete  and  the  elements  of  the  sys¬ 
tem  (1.1)  are  mutually  orthogonal  in  L2(JRd)  and  have  norm  1,  or,  equivalently,  ||p||  =  1 
and  A  =  B  —  1  in  (1.2).  One  of  the  fundamental  problems  in  Gabor  analysis  is  to 
classify  the  windows  g  and  time-frequency  sets  A  with  the  property  that  the  associated 
Gabor  system  Q(g,  A)  forms  a  (Gabor)  frame  or  an  orthonormal  basis  for  L2(Rd).  This 
is  of  course  a  very  difficult  problem  and  only  partial  results  are  known.  For  example,  to 
the  best  of  our  knowledge,  the  complete  characterization  of  time-frequency  sets  A  for 

which  (1.1)  is  a  frame  for  L2(Rd)  was  only  done  when  g  —  e~nx2 ,  the  Gaussian  window. 

2 

Lyubarskii,  and  Seip  and  Wallsten  15,17]  showed  that  Q(e~nx  ,  A)  is  a  Gabor  frame  if 
and  only  if  the  lower  Beurling  density  of  A  is  strictly  greater  than  1.  If  we  assume  that  A 
is  a  lattice  of  the  form  aZ  x  6Z,  then  it  is  well  known  that  ab  <  1  is  a  necessary  condition 
for  (1.1)  to  form  a  frame  for  L2(M,d).  Grochenig  and  Stockier  [7]  showed  that  for  totally 
positive  functions,  (1.1)  is  a  frame  if  and  only  if  ab  <  1.  If  we  consider  g  —  X[o,c)5  the 
characteristic  function  of  an  interval,  the  associated  characterization  problem  is  known 
as  the  abc-problem  in  Gabor  analysis.  By  rescaling,  one  may  assume  that  c  =  1.  In  that 
case,  the  famous  Janssen  tie  showed  that  the  structure  of  the  set  of  couples  (a,  b )  yielding 
a  frame  is  very  complicated  [9,8].  A  complete  solution  of  the  abc-problem  was  recently 
obtained  by  Dai  and  Sun  [2], 

In  this  paper,  we  focus  our  attention  on  Gabor  system  of  the  form  (1.1)  which  yield 
orthonormal  bases  for  L2(Rd).  Perhaps  the  most  natural  and  simplest  example  of  Gabor 

orthonormal  basis  is  the  system  Q(x\o  i|d,Zd  x  Zd).  The  orthonormality  property  for  this 
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system  easily  follows  from  that  facts  that  the  Euclidean  space  can  be  partitioned  by 
the  Zd-translates  of  the  hypercube  [0,  \]d  and  that  the  exponentials  e2ni(n’x)  form  an 
orthonormal  basis  for  the  space  of  square-integrable  functions  supported  on  any  of  these 
translated  hypercubes.  A  direct  generalization  of  this  observation  is  the  following: 

Proposition  1.1.  Let  \g\  —  \K\~1'2xk,  where  \  ■  \  denotes  the  Lebesgue  measure ,  and 
K  c  Rd  is  measurable  with  finite  Lebesgue  measure.  Suppose  that: 

•  The  translates  of  K  by  the  discrete  set  J  are  pairwise  a.e.  disjoint  and  cover  up 
to  a  set  of  zero  measure. 

•  For  each  t  e  J ,  the  set  of  exponentials  {e2’Kl'x,x'  :  A  e  At}  is  an  orthonormal  basis 
for  L2(K). 


Let 


A  =  (J  {t}  x  At.  (1.3) 

tej 

Then  Q(g,A)  is  a  Gabor  orthonormal  basis  for  L2 (Rd) . 

Although  its  proof  is  straightforward  and  will  be  omitted  (see  also  [14]),  this  proposi¬ 
tion  gives  us  a  flexible  way  of  constructing  large  families  of  Gabor  orthonormal  basis.  The 
first  condition  above  means  that  K  is  a  translational  tile  (with  J  called  an  associated 
tiling  set )  and  the  second  one  that  L2(K )  admits  an  orthonormal  basis  of  exponentials. 
If  this  last  condition  holds,  K  is  called  a  spectral  set  (and  each  At  is  an  associated  spec¬ 
trum).  The  connection  between  translational  tiles  and  spectral  sets  is  quite  mysterious. 
They  were  in  fact  conjectured  to  be  the  same  class  of  sets  by  Fuglede  [3],  but  that  state¬ 
ment  was  later  disproved  by  Tao  [18]  and  the  exact  relationship  between  the  two  classes 
remains  unclear. 

For  the  fixed  window  g,i  —  %[0)i]d,  we  call  a  countable  set  A  C  M?d  standard  if  it 
is  of  the  form  (1.3).  Motivated  by  the  complete  solution  to  the  a&c-problem,  our  main 
objective  in  this  paper  is  to  characterize  the  discrete  sets  A  (not  necessarily  lattices) 
with  the  property  that  the  Gabor  system  Q(ga,  A)  is  a  Gabor  orthonormal  basis.  First, 
by  generalizing  the  notion  of  orthogonal  packing  region  (see  Section  2)  in  the  work  of 
Lagarias,  Reeds  and  Wang  [12]  to  the  setting  of  Gabor  systems,  we  deduce  a  general 
criterion  for  Q(gd,  A)  to  be  a  Gabor  orthonormal  basis. 

Theorem  1.2.  Q(gu,  A)  is  a  Gabor  orthonormal  basis  if  and  only  if  Q{gj,  A)  is  an  orthog¬ 
onal  set  and  the  translates  of  [0,  l]2d  by  the  elements  of  A  tile  R2d. 

This  criterion  offers  a  very  simple  solution  to  our  problem  in  the  one-dimensional  case. 
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Theorem  1.3.  In  dimension  d  —  1,  the  system  Q(gi,A)  is  a  Gabor  orthonormal  basis  if 
and  only  if  A  is  standard. 

However,  such  a  simple  characterization  ceases  to  exist  in  higher  dimensions.  We  will 
introduce  an  inductive  procedure  which  allows  us  to  construct  a  Gabor  orthonormal 
basis  with  window  g,i  from  a  Gabor  orthonormal  basis  with  window  gn,  n  <  d.  This 
procedure  can  be  used  to  produce  many  non-standard  Gabor  orthonormal  basis  and  we 
call  a  set  A  obtained  through  this  procedure  pseudo-standard.  Assuming  a  mild  condi¬ 
tion  on  a  low- dimensional  time-frequency  space,  we  show  that  G(gj,  A)  are  essentially 
pseudo-standard  (see  Theorem  3.6). 

Although  we  do  not  have  a  complete  description  of  the  sets  A  yielding  Gabor  or¬ 
thonormal  bases  with  window  g,i  in  dimension  d  >  3,  we  managed  to  obtain  a  complete 
characterization  of  those  discrete  sets  AcK4  such  that  G(g2,A)  form  an  orthonormal 
basis  for  L2(R2). 


Theorem  1.4.  G(x\ o,i]2,A)  is  a  Gabor  orthonormal  basis  for  L2(R2)  if  and  only  if  we  can 
partition  TL  into  J  and  J'  such  that  either 

A  —  |^J  {(nr  T  tn,ki  n ,  j  T  k  T  Un)  ■  ^n ,  j ■  k  £  rEff 

n€j 

U  U  [J  {(rn  +  tn,n)}  x  Am^n 

ra(EZ  nej' 


or 


A  —  {(m,  Tl  ~b  ,  j  fc  H-  A^?,ra,n)  •  J?  &  ^ 

mEj 

U  {( Tfl ,  Tl  ~\~  tm)}  x  Am?n, 

where  Am;n  +  [0,  l]2  tile  M2  andtU)k,  gk,m,n  andun  are  real  numbers  in  [0,1)  as  a  function 
of  m,n  or  k.  See  Fig.  1. 

We  organize  the  paper  as  follows.  In  Section  2,  we  provide  some  preliminaries  nota¬ 
tions  and  prove  Theorem  1.2.  In  Section  3,  we  prove  Theorem  1.3  and  introduce  the 
pseudo-standard  time- frequency  set.  In  the  last  section,  we  focus  on  dimension  2  and 
prove  Theorem  1.4. 

2.  Preliminaries 

In  this  section,  we  explore  the  relationship  between  Gabor  orthonormal  bases  and 
tilings  in  the  time-frequency  space.  This  theory  will  be  an  extension  of  spectral-tile 

duality  in  [12]  to  the  setting  of  Gabor  analysis.  Denote  by  \K\  the  Lebesgue  measure  of 
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|y 


■> 

X 


Fig.  1.  This  figure  illustrates  the  time-domain  of  A  in  the  first  situation  of  Theorem  1.4.  We  basically 
partition  R2  by  horizontal  strips.  Some  strips,  like  R  x  [0,  1]  with  n  =  0,  have  overlapping  structure.  This 
corresponds  to  the  first  union  of  A.  Some  strips,  like  R  x  [1,2]  with  n  =  1,  have  tiling  structures.  This 
corresponds  to  the  second  union  of  A. 


a  set  I\ .  We  say  that  a  closed  set  T  is  a  region  if  \dT\  —  0  and  T°  —  T.  A  bounded 
region  T  is  called  a  translational  tile  if  we  can  find  a  countable  set  J  such  that 

(1)  |  (T  +  t)  n  (T  +  t') |  =  0,  t,t'  G  J ,  t  /  t' ,  and 

(2)  U KJ(T  +  t)=Rd. 

In  that  case,  J  is  called  a  tiling  set  for  T  and  T  +  J  a  tiling  of  Rd.  We  will  say  that  T  +  J 
is  a  packing  of  M.n  if  (1)  above  is  satisfied.  We  can  generalize  the  notion  of  tiling  and 
packing  to  measures  and  functions.  Given  a  positive  Borel  measure  /j  and  /  G  L]  (Mn) 
with  /  >  0,  the  convolution  of  /  and  p  is  defined  to  be 

/  *  hO)  =  J  f(x~  y )  dp(y),  x  G  Mn 

(where  a  Borel  measurable  function  is  chosen  in  the  equivalence  class  of  /  to  define  the 
integral  above).  We  say  that  /  +  g  is  a  tiling  (resp.  packing)  of  Rd  if  /  *  y  —  1  (resp. 
f  *  H  <  1)  almost  everywhere  with  respect  to  the  Lebesgue  measure.  It  is  clear  that  if 
f  =  Xt  and  n  —  5j  where  Sj  —  J2tej  then  f  *  n  —  1  is  equivalent  to  T  +  J  being  a 

tiling. 

First,  we  start  with  the  following  theorem  which  gives  us  a  very  useful  criterion  to 
decide  if  a  packing  is  actually  a  tiling.  In  fact,  special  cases  of  this  theorem  were  proved  by 
many  different  authors  in  different  settings  (see  e.g.  [12,  Theorem  3.1],  [11,  Lemma  3.1] 
and  [13]),  but  the  following  version  is  the  most  general  one  as  far  as  we  know. 

Theorem  2.1.  Suppose  that  F,G  G  L* 1 2(Mn)  are  two  functions  with  F,  G  >  0  and 
JRTl  F(x)  dx  —  JRn  G(x)  dx  —  1.  Suppose  that  p  is  a  positive  Borel  measure  on  such 
that 


F  *  p  <  1  and  G  *  p  <  1. 


Then,  F  *  p  —  1  if  and  only  if  G  *  p  —  1. 

174 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


1520 


J.-P.  Gabardo  et  al.  /  Journal  of  Functional  Analysis  269  (2015)  1515-1538 


Proof.  By  symmetry,  it  suffices  to  prove  one  side  of  the  equivalence.  Assuming  that 
F  *  n  =  1,  we  have 


1  =  F  *  fi  =$■  1  =  1  *G  =  G*F*n  =  F*G*fi. 

Letting  H  —  G  *  y  we  have  0  <  H  <  1  and  H  *  F  —  1.  We  now  show  that  H  =  1.  Indeed 
letting  A  be  the  set  {x  G  R?1,  H(x)  <  1}  and  B  —  Rn  \  A,  we  have 


(H*F)(x)  =  J  H(y)  F(x  —  y)  dy 


J  H{y)F(x  -y)dy  +  j  H{y)F(x 

A  B 


y)  dy. 


Now,  if  |A|  >  0,  we  have 


F(x  —  y)  dy  dx  =  |A|  >  0 


A 


and  there  exists  thus  a  set  E  with  positive  measure  such  that 


F(x  —  y)  dy  >  0, 


A 


If  x  G  E,  we  have 


J  H (y)  F(x  -  y)  dy  +  j  H (y)  F(x  -  y)  dy  <  J  F(x  -  y)  dy  +  j  F(x  -  y)  dy 

A  B  A  B 

—  (1  *  F)(x )  =  1. 


This  contradicts  to  the  fact  that  H  *  F  —  1  almost  everywhere.  Hence,  A  =  0  and 
H  —  1  follow.  □ 

Let  /,  g  G  L2(Rd).  We  define  the  short  time  Fourier  transform  of  /  with  respect  to 
the  window  g  be 


Vgf(t,u)=  J  f{x)  g{x  —  t)e  dx. 

R2d 


Let  G(g,  A)  be  a  Gabor  orthonormal  basis.  Since  translating  A  be  an  element  of  R2d 
does  not  affect  the  orthonormality  nor  the  completeness  of  the  given  system,  there  is 
no  loss  of  generality  in  assuming  that  (0,  0)  G  A.  We  say  that  a  region  D  (c  M.2d)  is  an 
orthogonal  packing  region  for  g  if 

(D°  -D°)nZ(Vgg)  =  tb. 

Here  Z(Vgg)  =  {(i,  u)  :  Vgg(t,  v)  =  0}. 
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Lemma  2.2.  Suppose  that  Q(g,  A)  is  a  mutually  orthogonal  set  of  L2(Rd).  Let  D  be  any 
orthogonal  packing  region  for  g.  Then  A  —  A  C  Z(Vgg)  U  {0}  and  A  +  D  is  a  packing 
ofR2d.  Suppose  furthermore  thatQ(g,A )  is  a  Gabor  orthonormal  basis.  Then  \D\  <  1. 

Proof.  Let  (t,  A),  (t' ,  A')  G  A  be  two  distinct  points  in  A.  Then 

J  g{pc  -  t')  g(x  -  t )  e-2™(A~A>  dx  =  0, 

or  equivalently,  after  the  change  of  variable  y  —  x  —  t' , 

J  g{x )  g(x-(t-t'))  e~27ri(x~x>  dx  =  0. 

Hence,  Vgg(t— t' ,  A— A')  =  0  and  (t,  A)  — (T,  A')  G  Z(Vgg).  This  means  that  (t,  A)  —  (t' ,  A')  (f 
D°—D°.  Therefore,  the  intersection  of  the  sets  (t,  A )+D  and  (T,  A ')+D  has  zero  Lebesgue 
measure. 

Suppose  now  that  Q(g,  A)  is  a  Gabor  orthonormal  basis.  Denote  by  R  the  diameter 
of  D.  By  the  packing  property  of  A  +  D, 


\D\ 


#(An  [-T,  T]2d) 


1 


(2D 


2  d 


< 


(2  T)2d 
1 

(2r)2d 


U  (C  +  A) 

A6An[— T,T]2d 

[-T  -  R,T  +  R]2d\  =  (1+^; 


2d 


Taking  limit  T  — »  oo  and  using  the  fact  that  Beurling  density  of  A  is  1  [16],  we  have 
\D\  <1.  □ 


We  say  that  an  orthogonal  packing  region  D  for  g  is  tight  if  we  have  furthermore 
\D\  —  1.  We  now  apply  Theorem  2.1  to  the  Gabor  orthonormal  basis  problem. 

Theorem  2.3.  Suppose  that  Q(g,  A)  is  an  orthonormal  set  in  L2(Rd )  and  that  D  is  a  tight 
orthogonal  packing  region  for  g.  Then  Q(g,A)  is  a  Gabor  orthonormal  basis  for  L2(Md ) 
if  and  only  if  A  +  D  is  a  tiling  ofR2d. 

Proof.  Let  F  =  Xd  and  G  =  \Vgf\2/\\f\\l.  Then  fR2d  F  =  1  and  fR2d  G  =  \\g\\\  =  1. 
Now,  as  D  is  an  orthogonal  packing  region  for  g,  we  have  in  particular 

^2xd(x  -  A)  <  1. 

AeA 


This  shows  that 


F  =  5\*  xd  <  1- 
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Moreover,  A  +  D  is  a  tiling  of  R2d  if  and  only  if  *  \d  —  1-  On  the  other  hand,  ( g ,  A) 
being  a  mutually  orthogonal  set,  Bessel’s  inequality  yields 


£ 

(i ,  A)  E  A 


f(x)  g(x  -  t)  e-2™0,*>  dx 


< 


f  e  L2(Md), 


or,  replacing  /  by  f(x  —  r)e27”!':r  with  (r,  v)  G  R2d, 


Y  \Vgf(r-t,u-X)\2<\\f\\2,  /GL2(Md). 

(i,  A)  E  A 


Hence, 


5  \  *  G  —  5  \  * 


<  1 


with  equality  if  and  only  if  the  Gabor  orthonormal  system  is  in  fact  a  basis.  The  con¬ 
clusion  follows  then  from  Theorem  2.1.  □ 

Proof  of  Theorem  1.2.  Let  gd  —  A[o,i]d-  Using  Theorem  2.3,  we  just  need  to  show  that 
[0,  l]2d  is  a  tight  orthogonal  packing  region  for  gd- 

We  first  consider  the  case  d  —  1.  For  g±  —  X[o.i]  ?  a  direct  computation  shows  that 

r  0,  \t\  >  1; 

V,Mt,v)=  \  (<?™‘ ~  *  0<t<  1;  (2.1) 

i  ik  (!  -  e2™<‘+1)) ,  -1  <  t  <  0. 

The  zero  set  of  V9lg\  is  therefore  given  by 

Z(Vgtg i)  =  {(t,  v)  :  |t|  >  1}  U  {(«,  v)  :  -  |«|)  e  Z  \  {0}}.  (2.2) 

Hence,  (0,  l)2  —  (0,  l)2  =  (—1,  l)2  does  not  intersect  the  zero  set  and  therefore  [0,  l]2  is 
a  tight  orthogonal  packing  region  for  g^. 

We  now  consider  the  case  d  >  2.  As  we  can  decompose  gd  as  X[o,i](^i)  ■  -  - X[o,i] (^rf) 3 
we  have 


V9dSd(t,  u)  =  V9lg1(ti,  ui) ...  V9lgi(td ,  vd)  where  t  =  (t  i, . . .  ,td)  and  v  =  (iq, . . . ,  vd). 
The  zero  set  of  V9d  gd  is  therefore  given  by 

Z(Vgd3d )  =  {(t,v)  :  |t|max  >  1}  U  |  |J{(t,n)  :  1  -  \U\)  G  Z\  {0})} 
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where  |f|max  =  max{ii, . . . ,  G}.  It  follows  that  [0,  l]2rf  is  a  tight  orthogonal  packing 
region  for  y^.  □ 

The  following  example  will  not  be  used  in  later  discussion,  but  it  demonstrates  the 
usefulness  of  the  theory  for  windows  other  than  the  unit  cube. 

Example  2.4.  Let  g(x)  =  e2x_^e-2x  be  the  hyperbolic  secant  function.  It  can  be  shown 
([10];  see  also  [4])  that 


7rsin(7rnt)e 

9 9  ’  sinh(2£)  smh{iv2v/2) 

and  the  zero  set  is  given  by 

Z(Vgg)  =  {(t,  u):tueZ\{ 0}}. 

Hence,  [0,  l]2  is  a  tight  orthogonal  packing  region  for  g.  Note  that  the  zero  set  does  not 
contain  any  points  on  the  .x-axis  and  y-axis.  There  is  no  tiling  set  A  for  [0,  l]2  such  that 
A  —  A  C  Z(Vgg)  U  {0}  (see  also  Proposition  3.2  in  the  next  section)  and  thus  there  is  no 
Gabor  orthonormal  basis  using  the  hyperbolic  secant  as  a  window.  This  can  be  viewed 
as  a  particular  case  of  a  version  of  the  Balian-Low  theorem  valid  for  irregular  Gabor 
frames  which  was  recently  obtained  in  1]  and  which  state  that  Gabor  orthonormal  bases 
cannot  exist  if  the  window  function  is  in  the  modulation  space  M1(Md). 

3.  Gabor  orthonormal  bases 

Using  Lemma  2.2,  Theorem  1.2  may  be  restated  in  the  following  way: 

Theorem  3.1.  G(x[ o.ij^A)  is  a  Gabor  orthonormal  basis  if  and  only  if  the  inclusion 
A  —  A  C  Z(Vgg)  U  {0}  holds  and  A  +  [0,  l]2d  is  a  tiling. 

In  view  of  the  previous  result,  the  possible  translational  tilings  of  the  unit  cube  on 
M2,/  play  a  fundamental  role  in  the  solution  of  our  problem.  A  characterization  for  these 
is  not  available  in  arbitrary  2d  dimension  but  it  is  easily  obtained  when  d  —  1.  We  prove 
this  result  here  for  completeness  but  it  should  be  well  known. 

Proposition  3.2.  Suppose  that  X[o,i]2  +  J  is  a  tiling  ofR 2  with  (0,0)  e  J .  Then  J  is  of 
either  of  the  following  two  forms: 

fj  —  (Z  +  afc)  x  {k}  or  fj  =  {/c}  x  (Z  +  a\f)  (3-1) 

fcez  fcez 

where  ak  are  any  real  numbers  in  [0, 1)  for  k  ^  0  and  ao  —  0. 
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Proof.  By  Keller’s  criterion  for  square  tilings  (see  e.g.  [12,  Proposition  4.1]),  for  any 
(ii,t2)  and  {t'^t'2)  in  J,  ti  —  t[  G  Z  \  {0}  for  some  i  —  1,2.  Taking  =  (0,0),  we 

obtain  that,  for  any  (ti,t2)  G  J  \  {(0,  0)},  one  of  ti  or  t2  belongs  to  Z  \  {0}.  If  J  C  Z, 
we  must  have  J  —  7L  for  X[o,i]2  +  J  to  be  tiling  of  M2  and  Z  can  be  written  as  either  of 
the  sets  in  (3.1)  by  taking  a k  —  0  for  all  k.  Suppose  that  there  exists  (si,s2)  G  J  such 
that  Si  is  not  an  integer  and  s2  G  Z.  If  (G,  t2)  G  J  and  i2  ^  Z,  then  both  t\  and  t\  —  si 
must  be  integers  which  would  imply  that  si  is  an  integer,  contrary  to  our  assumption. 
Hence,  (si,  s2)  G  J  implies  s2  G  Z  and  we  can  write 

J  =  (J  Jk  x  {A:}, 

for  some  discrete  set  Jk  C  M.  For  X[o,i]2  +  J"  to  be  a  tiling  of  R2,  the  set  Jk  must  be 
of  the  form  Jk  —  Z  +  a^.  In  that  case  J  can  be  expressed  as  one  of  the  sets  in  the  first 
collection  appearing  in  (3.1). 

Similarly,  if  there  exists  (si,s2)  G  J  such  that  s2  is  not  an  integer  and  Si  G  Z,  J  can 
be  expressed  as  one  of  the  sets  in  the  second  collection  appearing  in  (3.1).  This  completes 
the  proof.  □ 

We  say  that  the  Gabor  orthonormal  basis  G{x\o,i]d->  A)  is  standard  if 

A  =  (J  {t}  x  At, 

t€j 

where  J  +  [0,  l]d  tiles  Rd  and  At  is  a  spectrum  for  [0,  l]d.  (Note  that,  by  the  result  in  12], 
At  +  [0,  \\d  must  then  be  a  tiling  of  for  every  t  G  J .) 

The  following  result  settles  the  one-dimensional  case. 

Theorem  3.3.  ^(X[0il],A)  is  a  Gabor  orthonormal  basis  if  and  only  if  A  is  standard. 

Proof.  We  just  need  to  show  that  A  being  standard  is  a  necessary  condition  for 
£?(X[o,i],A)  to  be  a  Gabor  orthonormal  basis.  We  can  also  assume,  for  simplicity,  that 
(0,0)  G  A.  By  Proposition  3.1,  if  C/(X[o,i.],A)  is  a  Gabor  orthonormal  basis,  then 
A  —  A  C  Z(Vgg)  U  {0}  and  A  +  [0,  l]2  must  be  a  tiling  of  M2.  By  Proposition  3.2, 
A  must  be  of  either  one  of  the  forms  in  (3.1).  Note  that  A  is  standard  in  the  second  case. 
In  order  to  deal  with  the  first  case,  suppose  that 

A  =  (Z  +  ajfc)  x  {A;},  with  G  [0, 1),  k  ^  0,  ao  =  0. 
fcez 

We  now  show  that  this  is  impossible  unless  =  0  for  all  k  (which  reduces  to  the  case 
A  =  Z2,  which  is  standard).  We  can  assume,  without  loss  of  generality,  that  afc  7^  0 

for  some  k  >  0  with  k  being  the  smallest  such  index.  If  ak  ^  0  for  some  k,  then  both 
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(afe,  k)  and  (0,  k  —  1)  are  in  A.  The  orthogonality  of  the  Gabor  system  then  implies  that 
(afe,  1)  G  Z(Vgg).  Using  (2.2),  we  deduce  that  1  •  (1  —  |afc|)  6  Z  \  {0}.  That  means  a*, 
must  be  an  integer,  which  is  a  contradiction.  Hence,  the  first  case  is  impossible  unless 
cifc  =  0  for  all  k  and  the  proof  is  completed.  □ 

A  description  of  all  time-frequency  sets  A  for  which  G(X[o,i]d->  A)  is  a  Gabor  orthonor¬ 
mal  basis  however  become  vastly  more  complicated  when  d  >  2.  In  particular,  as  we  will 
see,  the  standard  structure  cannot  cover  all  possible  cases.  Consider  integers  m,n  >  0 
such  that  m  +  n  —  d.  For  convenience  and  to  be  consistent  with  our  previous  notation, 
we  will  write  the  cartesian  product  of  the  two  time-frequency  spaces  IR2m  and  R2n  in  the 
non-standard  form 

M2 d  =  ]R2m  X  R2n  =  {(a,  t,  A,  v)  :  (s,  A)  G  K2m,  (t,  v)  G  R2n}. 

We  will  also  denote  by  ^  the  projection  operator  from  R2d  to  M2m  defined  by 

nx  ((s,t,\,u))  =  (a,  A),  (s,t,X,u)  G  R2d  =  K2m  x  M2n.  (3.2) 

To  simplify  the  notation,  we  also  define  gk  =  X[o,i]fc  f°r  any  k  >  1.  We  now  build  a  new 
family  of  time-frequency  sets  on  R2d  as  follows.  Suppose  that  G{X[ o,i]^ ,  Ax)  is  a  Gabor 
orthonormal  basis  for  L2(M,n)  and  that  we  associate  with  each  (s,A)  G  Aj,  a  discrete 
set  A(s  A)  in  R2n  such  that  G{X[o,i]ni  is  a  Gabor  orthonormal  basis  of  L2(Mn).  We 

then  define 


IJ  ((s,t,A,i/):(t,i/)GAM}.  (3.3) 

(s,A)€Ai 


We  say  that  a  Gabor  system  G(X[o,i]d’  A)  with  A  as  in  (3.3)  is  pseudo-standard. 

Proposition  3.4.  Every  pseudo- standard  Gabor  system  G(X[o,i]di  A)  a  Gabor  orthonor¬ 
mal  basis  of  L2(Rd). 

Proof.  If  x  G  Rm  and  y  G  Mn,  we  have  gd{x,y )  =  gm(x)gn(y)  (for  m  +  n  —  d).  This 
yields  immediately  that 

Vgd9d(s,t,\,v)  =  Vgmgrn(s,X)  Vgngn(t,v),  (s,A)GR2m,  (ty)GR2".  (3.4) 

Suppose  that  p  —  (s,  t,  A,  v)  and  p'  —  (s' ,  t' ,  A',  z/)  are  distinct  elements  of  A.  If  (s,  A)  = 
(s',  A'),  then  (t,  v)  and  (t' ,  v')  are  distinct  elements  of  and  we  have  thus 

(f  -t,  v'  -u)e  Z(Vgngn) 
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which  implies  that  Z(V9dgd){p'  —  p)  =  0.  On  the  other  hand,  if  (s,  A)  ^  (s',  A')),  we  have 
then 


(s' -s:\'-\)£Z(Vgm9m) 

which  implies  again  that  Z(Vgdgd)(p'  —  p)  —  0.  This  proves  the  orthonormality  of  the 
system  Q(x[o,i]di  A)-  This  proposition  can  now  be  proved  by  invoking  Theorem  3.1  if  we 
can  show  that  A  +  [0,  l]2d  is  a  tiling  of  M?d.  To  prove  this,  we  note  that  Ai  +  [0,  l]2m 
is  a  tiling  of  the  subspace  R2m  by  Theorem  3.1  and  that,  similarly,  for  each  (t,  A)  G 
A(t,A)  +  [0,  l]2rt  is  a  tiling  of  R2n.  This  easily  implies  the  required  tiling  property  and 
concludes  the  proof.  □ 

Example  3.5.  Consider  the  two-dimensional  case  d  —  2.  Let 

Ai  =  (J  (m)  x  (Z  +  Mm),  Mm  G  [0, 1). 

m^Z 

Associate  with  each  7  =  (m,  j  +  /zm)  E  Ai,  the  set 

A*y  —  |^J  \jl  +  £771,7}  X  H”  771,7) J  &m,j  ^  ^n, 771,7  ^  [0?  !)• 

n{EZ 


Then, 


A  . —  {(?7l,  Tl  H  Smj  ,  J  h  /Ittt,,  A  f-  Fn,m,j  )  ■  ^7.,  j ■>  k  G 

(written  in  the  form  of  Ai,  A2)  where  (M 1 ,  ^2)  are  the  translations  and  (Ai,  A2)  the 

frequencies)  has  the  pseudo-standard  structure.  Note  that  the  parameters  smj  can  be 
chosen  so  that  the  set  A  is  not  standard  as  the  set 

{(m,  n  +  sm,j )  :  rn,n,j  G  Z}  +  [0,  l]2 

will  not  tile  1R2  in  general.  For  example,  for  m  —  n  —  0,  we  could  let  So,o  =  0  and  the 
numbers  Sqj  could  be  chosen  as  distinct  numbers  in  the  interval  [0, 1).  The  square  [0,  l]2 
would  then  overlap  with  infinitely  many  of  its  translates  appearing  as  part  of  the  Gabor 
system. 

Using  a  similar  procedure  to  higher  dimension,  we  can  produce  many  non-standard 
Gabor  orthonormal  bases  with  window  X[0  qd.  However,  the  pseudo-standard  structure 
still  cannot  cover  all  possible  cases  of  time- frequency  sets.  A  time-frequency  set  could  be 
a  mixture  of  pseudo-standard  and  standard  structure.  For  example,  consider  the  set 

A  =  U  { {jP  -|  ^n^ki  Tlyj  — (—  Pk1rn7n ,  A.  — f  Fn)  ■  j ■  A*  G  Zj-  U  { (?77,  1) }  X  A m, 

n£Z\{l} 
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where  Am  +  [0,  l]2  tiles  R2.  This  set  consists  of  two  parts.  The  first  part  is  a  subset  of  a 
set  having  the  pseudo-standard  structure  while  the  second  part  is  a  subset  of  a  set  having 
the  standard  one.  Moreover,  the  translates  of  the  unit  square  associated  with  the  first 
part  are  disjoint  with  those  associated  with  the  second  part,  showing  that  f?(X[o,i]2,  A) 
is  a  mutually  orthogonal  set.  Since  A  is  clearly  a  tiling  of  R4,  Theorem  3.1  shows  that 
G(X[ o,i]25  A)  is  a  Gabor  orthonormal  basis. 

In  the  next  section,  we  will  classify  all  possible  sets  AcR4  with  the  property  that 
^(X[o,i]2j  A)  is  a  Gabor  orthonormal  basis  for  L2(R2).  However,  we  have 

Theorem  3.6.  Let  d  —  m  +  n  and  let  Hi  :  R2d  — >■  R2m  be  defined  by  (3.2).  Suppose  that 
(X[o,i]d)  A)  is  a  Gabor  orthonormal  basis  and  that  Hi  (A)  +  [0,  l]2m  tiles  R2m.  Then  A  has 
the  pseudo-standard  structure. 

Proposition  3.7.  Let  d  —  m  +  n  and  suppose  that  (X[0)1]d,A)  is  a  Gabor  orthonormal 
basis  for  L2(Rd).  If  (,sq,  Ao)  G  R2m,  consider  the  translate  of  the  unit  hypercube  in  R2m, 
C  —  (so,  Ao)  +  [0,  l)2m,  and  define 

A (C)  :=  {(t,  v)  G  R2n  :  (s,  t,  A,  v)  G  A  and  (s,  A)  G  C}. 

Then  (x[o,i]n)  A(C))  is  a  Gabor  orthonormal  basis  for  L2(R2n). 

Proof.  We  first  show  that  the  system  (X[o,i]«,  A((7))  is  orthogonal.  Let  (t,v)  and 
be  distinct  elements  of  A(C).  There  exist  (s,  A)  and  (s',  A')  in  R2m  such  that  (s,  t,  A,  v)  and 
(s',  t',  A',  v')  both  belong  to  A.  Using  the  mutual  orthogonality  of  the  system  (X[o.i]d?  A) 
together  with  (3.4),  we  have 

Vgm9m{s  ~  s',  A  -  A')  =  0  or  V9ngn(t  -  t',  u  -  u')  =  0. 

Note  that,  as  both  (s,  A)  and  (s',  A')  belong  to  C,  we  have  |s  —  s'|max  <  1  and  |A  — 
A'  | max  <  1-  In  particular,  Vgrn  grn(s  —  s',  A  —  A')  ^  0  and  the  orthogonality  of  the  system 
(X[o,i]»,  A(C7))  follows. 

If  (s,  A)  G  ni(A)  (as  defined  in  (3.2)),  let 


\s, A)  =  {(t,u)  :  (s,t,  A,  v)  G  A}. 


Let  fi  G  L2(Rm),  /2  G  L2(Rn)  and  (so,Ao)  G  R2m.  Applying  Parseval’s  identity  to  the 
function 


f(x,  y)  —  e2niXo'x  fi(x  —  s0)  fiiy),  x  G  Rm,  y  G  Rn, 


we  obtain  that 
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J  \h(x)\2dx  J  |/2(y)|2  dy 

=  E  E  !n„/i('*-so,A-Ao)|2|VSl,/2((,1/)|2 

(s,A)€ni(A)  (t,v)£ A(s  A) 

=  E  E  A/i5m(s0  —  S.  \q  —  A)|2  \Vgnf2{t,  l>)\2 . 

(s,A)6lIi(A)  A(Sj^) 

Defining 

w(s, X)  =  \\fc\\22  \Vgnh(t>  u)\2  and  l*=  Y1  w(s’X)S(s,\) 

(t,u)  €A(s>X)  (s,A)€lIi(A) 

for  /2  7^  0,  the  above  identity  can  be  written  as 

/  \h{x)\2dx  =  ^2  w(si  A)  \Vflgm{so  ~  s,X0  -  A)|2  =  (//* \Vflgm\2)  (s0,A0). 

fm  (s,A)€IIi(A) 

On  the  other  hand,  letting  X[o,i)  2  m  0,A)  =  X[o,i)  2m  (— s,  —A)  and  defining  C  and  A(C)  as 
above,  we  have  also 

(a**X[o,i)"*)  («o,A0)  =  u;(s’A)X[o,i)2-(s  -  «o,  A  -  A0) 

(s,A)Elli  (A) 

=  J2  w{s,X) 

(s,A)en1(A)n  c 

=  ll/slla2  E  \V,j2(t,v)\2<l, 

(t,u)e\(C) 

where  the  last  inequality  results  from  the  orthogonality  of  the  system  (%[01]n,  A(C)) 
proved  earlier.  Since  (scuAo)  is  arbitrary  in  M2m  and 

J  \vh9m(s,  X)\2dsdX  =  H/1II2, 

*2m 

Theorem  2.1  can  be  used  to  deduce  that  /!  *  X[0,i)m  —  1-  This  shows  that 

E  IT^AGOI2  =  II/2II2,  h  e  r2(R"), 

(t,v)€A(C) 

and  thus  that  the  system  (X[o,i]«,  A(C))  is  complete,  proving  our  claim.  □ 

Proof  of  Theorem  3.6.  Let  J  =  IIi(A)  and,  for  any  (s,  A)  G  J ,  define 

A(a,A)  =  {(t,u)  :  (s,t,  A,  v)  G  A}. 
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If  (so,  Ao)  G  J,  let  C  =  (so,  Ao)  +  [0,  l)2m,  and 

A (C)  :=  {(t,  v)  G  R2n  :  (s,  t,  V)gA  and  (s,  A)  G  C}. 

Proposition  3.7  shows  that  the  system  (X[o,i]n,  A(C))  forms  a  Gabor  orthonormal  basis. 
By  assumption  J  +  [0,  l)2m  tiles  R2m.  Hence,  (so,Ao)  +  [0,  l)2m  contains  exactly  one 
point  in  J,  i.e.  (so,Ao),  and  we  have 

A(C)  =  {{t,  v)  :  (s0,  t,  A0,  v)  G  A}  =  A(so>Ao). 

Therefore,  we  can  write  A  as 

A  =  (J  {(so,  A0)}  x  A(SO)Ao). 

Go  Ao)€  J 

Our  proof  will  be  complete  if  we  can  show  that  J  is  a  Gabor  orthonormal  basis  of 
L2(Rm). 

As  J  is  a  tiling  set,  by  Proposition  3.1  it  suffices  to  show  that  the  inclusion  J  —  J  C 
Z(Vgm9m)  O  {0}  holds.  Let  (s,  A)  and  (s',  A')  be  distinct  points  in  J .  As  A(SjA)  +  [0,  l)2n 
tiles  R2n,  so  does  A(SjA)  +  [—1, 0)2n,  and  we  can  find  (t,  v)  G  A(s,\)  such  that  0  G  (t,  v)  + 
[— l,0)2n,  or,  equivalently,  with  (t,u)  G  [0,  l)2n.  Similarly,  we  can  find  G  Afs^y) 

such  that  G  [0,  l)2n.  Using  the  fact  that  (x[0,i]d,  A)  is  a  Gabor  orthonormal  basis 

of  L2(M.2d),  we  have 


(s,  t,  A,  v)  -  (s',  t',  A',  v')  G  Z(V9dgd), 


or,  equivalently, 


V9m9m(s  ~  s',  A  -  A')  =  0  or  V9ngn(t  -  t' ,  v  -  u')  =  0. 

Note  that,  since  \t— 1'\  <  1  and  \v— u'\  <  1,  V9ngn{t— t' ,  v—v')  ^  0.  Hence  (s,  A)  — (s',  A')  G 
Z{Vgmgrn)  as  claimed.  □ 

4.  Two-dimensional  Gabor  orthonormal  bases 

In  this  section,  our  goal  will  be  to  classify  all  possible  Gabor  orthonormal  basis  gen¬ 
erated  by  the  unit  square  on  1R2. 

Given  a  fixed  Gabor  orthonormal  basis  G(X[ o,i]2.  A)  and  a  set  A  C  R2,  we  define  the 
sets 


r(A)  —  {(Ai,  A2)  G  R2  :  (^1,^2,  Ai,  A2)  G  A,  (G,  t2)  G  A}, 

and,  for  any  (Ai,  A2)  G  R2  and  any  set  B  C  R2,  we  let 
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Ta(Xi,  A2)  —  {(£i,  £2)  £  R2  :  (£i)  £2,  Ai,  A2)  G  A,  (£1,  £2)  G  A} 


and 


Ta(B)  —  { (£  1 , £2)  G  M2  :  (£1,  £2,  Ai,  A2)  G  A,  (£i,£2)  G  A,  (Xi,X2)eB}. 

In  particular,  the  set  Ta(T(A))  collects  all  the  couples  (£1 , £2)  G  A  such  that 
(£1,  £2,  Ai,  A2)  G  A  for  some  (Ai,  A2)  G  M2. 

We  say  that  a  square  is  half- open  if  it  is  a  translate  of  one  of  the  sets 

[0,  l)2,  (0,  l]2,  [0, 1)  x  (0, 1]  or  (0,1]  x  [0,1). 

Two  measurable  subsets  of  M.d  will  be  called  essentially  disjoint  if  their  intersection  has 
zero  Lebesgue  measure.  In  the  derivation  below,  we  will  make  use  of  the  identity 

V92g2(ti,t2,  Ai,  A2)  =  V9lgi(ti,  Xi)Vgigi(t‘2,  A2),  (£1,  £2,  Ai,  A2)  G  1R4, 

which  implies,  in  particular,  that 

V92g2(t1,t2,  Ai,  A2)  =  0  V9lgi(ti,  Ai)  =  0  or  Vgigi(t2,  A2)  =  0. 

Moreover,  using  (2.3),  the  zero  set  of  V92g2  is  given  by 

Z(Vg2g2 )  =  {(£,  A)  :  |£|max  >  1}  U  |jj {(£,  v)  :  A*(l  -  |£,|)  G  Z  \  {0})}^  .  (4.1) 

This  implies  that  if  |£|max  <  1  and  (£,  A)  G  Z(Vg2g2),  then,  there  exists  i  G  {1,  2}  and  for 
some  integer  m  ^  0  such  that 


with  a  strict  inequality  if  £,;  ^  0.  These  properties  will  be  used  throughout  this  section. 

Lemma  4.1.  Let  <?(X[o,i]2,  A)  be  a  Gabor  orthonormal  basis  for  L2(1R2)  and  let  C  be  a 
half-open  square.  Then, 

(i)  T(C)  +  [0,  l]2  is  a  packing  0/M2. 

(ii)  If  (Ai,  A2)  G  r(C),  then  Ai,  A2)  consists  of  one  point. 


Proof,  (i)  Let  (Ai,  A2)  and  (A) ,  A'2)  be  distinct  elements  of  T(C).  By  definition,  we  can 
find  (£1,  £2)  and  (£^ ,  £3)  in  C  such  that  (£1,  £2,  Ai,  A2),  (£'i,  t'2,  A'i,  A2)  G  A.  We  then  have 

0  =  Vgi9i(ti  ~  A  •  Ai  —  A'x)  Vgigi{t2  -  t'2l  X2  —  A'2). 
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If,  without  loss  of  generality,  the  first  factor  on  the  right-hand  side  of  the  previous 
equality  vanishes,  the  fact  that  \t\  —t\  \  <  1  shows  the  existence  of  an  integer  k  >  0  such 
that 

I Ai  —  A^l  =  k/(l  —  \t\  —  t'i|)  >  1. 

Hence,  the  cubes  (Ai,  A2)  +  [0,  l]2  and  (A'1;  A2)  +  [0,  l]2  are  essentially  disjoint. 

(ii)  Suppose  that  Tc(Ai,A2)  contains  two  distinct  points  (ti, ^2)  and  (^1 5^2)-  Then, 

0  =  Vgigi(ti  -  t1 1,0)  V9lgi(t2  -  t'2,  0). 

As  V9lgi(t,0)  ^  0  for  any  t  with  |t|  <  1,  we  must  have  \ti  —  t[\  >  1  or  fa  —  t'2\  >  1, 
contradicting  the  fact  that  both  (61,62)  and  (6^,6 2)  belong  to  C.  □ 

In  the  following,  we  will  denote  by  dA  the  boundary  of  a  set  A.  The  next  result  will 
be  useful. 


Lemma  4.2.  Under  the  hypotheses  of  the  previous  lemma,  consider  an  element  A  = 
(Ai ,  A2)  of  r(C)  and  let  Tc( A)  =  {(^1,^2)}-  Then  for  any  x  E  d(X  +  [0,  l]2),  we  can 
find  Xx  —  (AijX,  A2,x)  £  r(C)  such  that  x  E  d(Xx  +  [0, 1] 2 ) -  Moreover,  for  any  such  Xx, 
letting  Tc( Xx)  =  {tx},  where  tx  —  (6ijX,  62,^),  we  can  find  io  E  {1,  2}  such  that  6j0iX  —  ti0 
and  A^q  )X  —  Aj0  T  1  or  X 1. 


Proof.  We  can  write  x  —  (Ai  +  e1;  A2  +  62),  where  0  <  Cj  <  1,  i  =  1,2  and  e.j  G  {0, 1} 
for  at  least  one  index  i.  Let  a  —  (01, 02)  G  R2  with  0  <  a*  <  1  for  i  —  1,  2  and  consider 
the  point  (6a,  x)  (61  +  ai,  £2  +  a 2,  Ai  +  ei,  A2  +  £2)  in  M4.  Since  A  +  [0,  l]4  is  a  tiling  on 
M4  and  the  point  (6a,  x)  is  a  point  on  the  boundary  of  (6,  A)  +  [0,  l]4,  we  can  find  some 
point  (tX)a,  A X:a)  G  A  \  {(6,  A)}  such  that  (6a,  x)  E  (6Xja,  Ax,a)  +  [0,  l]4.  Let  tx^a  =  (6{,  t'2) 
and  AXia  =  (A'i,A2).  We  have 


f  i  A:  ti  4  —  ^  ^  /  ■ 

1  —6-i  <  X{  —  A(  <  1  —  ti, 


(4.2) 


Using  the  orthogonality  of  the  system  G(X[ o,ip,A),  we  can  find  io  E  {1,2}  such  that 
Vgi9i(ti0  ~  t'io,Xi0  -A'J  =  0.  Note  that  tio  -  t'io  ±  0  would  imply  that  |Aio  -  A'J  >  1 
which  is  impossible  from  (4.2).  Hence,  U0  =  t\  and  Ai0  —  A'o  ^  0. 

Moreover,  as  V9lgi(0,v)  ^  0  if  |u|  <  1,  V9lgi(ti0  —  t'io,Xi0  —  A'J  =  0  can  only  occur 
if  |Aj0  —  A'J  =  1.  This  shows  also  that  ej0  G  {0,1}  in  that  case.  This  proves  the  last 
statement  of  our  claim  and  the  fact  that  x  E  <9(AXja  +  [0,  l]2).  The  proof  will  be  complete 
if  we  can  show  that  XXja  G  T((7)  for  some  choice  of  a. 

For  simplicity,  we  consider  the  half-open  square  to  be  C  —  [61,61  +  1)  x  [62,62  +  1). 
Our  assertion  will  be  true  if  the  point  tX)Cl  —  (6},6 2)  constructed  above  satisfies  the 

inequalities  6j  <  t\  <  ht  +  1  for  i  —  1,2.  As  U0  =  6'  ,  the  inequalities  clearly  hold  for 
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i  —  Iq.  Suppose  that  the  other  index  j  falls  out  of  the  range,  say  t'  <  bj  (the  case 
t'j  >bj  +  l  is  similar).  We  consider  (ta',x)  with  a'-  =  t'j  +  1  —  tj  +  5  for  some  small  5  >  0. 
Note  that,  by  (4.2),  we  have  ti  +  ai  —  1  <  t\  <  +  +  a*  for  i  —  1,2,  and,  in  particular, 

cij  —  tj  1  —  >  0. 

We  have  also  a)  <  1.  Indeed,  the  inequality  t'j  —  tj  +  1  +  8  >  1  would  imply  that 
tj  +  1  +  8  >  1  +  tj.  This  is  not  possible,  as  bj  <  tj  <  bj  +  1,  so  1  +  tj  >  bj  +  1.  But 
t'j  <  bj ,  so  t'j  +  1  <  bj  +  1,  so  for  8  small, 

t'j  T  1  T  8  <C  bj  T  1  5;  1  +  tj 


which  yields  a  contradiction. 

Using  the  previous  argument  with  a'  replacing  a,  we  guarantee  the  existence  of  t" 
such  that  t'j  +  8  —  tj  +  a'j  —  1  <  t"  <  tj  +  a'-  =  t'j  +  1  +  5  and  the  associated  point 
(ia> ,  Xx,a' )  =  (d/ ,  t'2 ,  A",  A(,' )  in  A  with  the  property  that  a;  G  5(AXia/  +  [0,  l]2)  for  some 
index  A  such  that  |Av  —  A",  I  =  1,  tv  =  t",  and  ev  G  |0, 1}.  We  claim  that  t'-  —  t(+ 1.  Now, 
(t[,  t'2,  A'l5  A2)  and  (£",  A'/,  A'2)  are  in  A.  The  mutual  orthogonality  property  implies 

that  Vgi5i(t'  —  t",  A(  —  A")  =  0  for  some  t  =  1,2. 

Suppose  that  x  is  not  of  the  corner  points  of  A  +  [0,  l]2.  In  that  case,  the  index  i  such 
that  ti  G  {0, 1}  is  unique  and  it  follows  that  io  —  i'0.  This  implies  in  particular,  that 
t'i0  ~  ^  (as  t'i0  —  U0  —  ti>Q  —  t"  —  t"  ).  Furthermore,  the  second  set  of  inequalities 
in  (4.2)  shows  that  A'o  =  A"  =  Aio  -  1  if  eio  =  0  and  A'o  =  A"  =  Aio  +  1  if  eio  =  1.  We 

have  thus  A'  =  X'J  in  both  cases.  We  have  thus 

*0  *0 

V9-t (h (4  -  C’  Xi0  -  XV  =  VaigiM  =  !■ 

Therefore,  the  other  index  j  must  satisfy  Vgigi(tj  —  t",  A'  —  A'')  =  0.  The  inequalities 

—tj  <  Aj  —  Aj  <  1  —  tj  and  —  tj  <  A?  —  X"  <  1  —  ej 

yield  —  1  <  A'-  —  X'-  <  1.  However,  8  <  t"  —  t'j  <  1  +  5.  The  U7l  <j\  would  not  be  zero 
unless  t'j  >  t'j  +  1  (>  bj).  Hence,  £'•  +  1  <  t"  <  £'•  +  1  +  5.  This  forces  that  tj  =  £'•  +  1. 
This  completes  the  proof  for  non-corner  points.  If  cc  is  of  the  corner  point,  as  the  square 
constructed  for  the  non-corner  will  certainly  cover  the  corner  point.  Therefore,  the  proof 
is  completed.  □ 

With  the  help  of  the  previous  two  lemmas,  the  following  tiling  result  for  T(C)  follows 
immediately. 

Corollary  4.3.  Let  C  be  a  half-open  square.  Then  T(C)  +  [0,  l]2  is  a  tiling  ofM2. 

Proof.  It  suffices  to  prove  the  following  statement:  suppose  that  J  +  [0,  l]2  is  non-empty 

packing  of  M2.  If,  for  any  x  G  d(t  +  [0,  l]2)  where  1  G  J,  we  can  find  tx  G  J  with  tx  ^  t 
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such  that  x  G  d(tx  +  [0,  l]2),  then  J  +  [0,  l]2  is  a  tiling  of  M2.  Indeed,  by  Lemma  4. 1  (i) 
and  Lemma  4.2,  T(C)  +  [0,  l]2  is  a  packing  of  R2  and  satisfies  the  stated  property.  It  is 
thus  a  tiling  of  K2. 

To  prove  the  previous  statement,  we  note  that  as  J  +  [0,  l]2  is  packing,  it  is  a  closed 
set.  Suppose  that  J  +  [0,  l]2  satisfies  the  property  above  and  that  \  {J  +  [0,  l]2)  ^  0. 
Let  x  G  d(J  +  [0,  l]2)  and  assume  that  x  G  t  +  [0,  l]2.  We  can  then  find  tx  G  J  with 
tx  ^  t  such  that  x  G  d(tx  +  [0,1]2).  Note  that  if  x  were  not  a  comer  point  of  either 
t  +  [0,  l]2  or  tx  +  [0,  l]2,  then  x  would  be  in  the  interior  of  J  +  [0,  l]2.  Hence,  x  must  be  a 
corner  point  of  t  +  [0,  l]2  or  tx  +  [0,  l]2.  As  the  set  of  all  the  corner  points  of  the  squares 
in  J  +  [0,  l]2  is  countable,  the  Lebesgue  measure  of  the  open  set  Rd  \  (J  +  [0,  l]2)  is  zero 
and  Rd  \  ( J  +  [0,  l]2)  is  thus  empty,  proving  our  claim.  □ 

Lemma  4.4.  Let  C  be  a  half-open  square  and  suppose  that  (Ai,A2)  G  T(C)  with 
Tc{ Ai,A2)  =  { (^i , ^2) } -  Then  all  the  sets  A^A^)  with  (A^A^)  G  T(C)  are  either 
of  the  form  { (t  1 ,  ^2  +  s)}  or  { (t  1  +  s,t2)}  for  some  real  s  with  |s|  <  1  depending  on 
(Ai,  A2). 


Proof.  We  first  make  the  following  remark.  If  (a  i,ce2),  (f3i,  /32)  €=  T(C)  are  such  that 
the  two  squares  (oq,  a2)  +  [0,  l]2  and  (/?i,  @2)  +  [0,  l]2  intersect  each  other  and  also  both 
intersect  a  third  square  (71,72)  +  [0,  l]2  with  (71,72)  G  r(C),  then,  letting  Tc{ 71,72)  = 
{(ri, r2)},  we  have 

Tc(ai,a2)  =  {(n  +  a,r2)}  and  Tc(/3i,  P2)  =  {(ri  +  b,  r2)} 


or 


Tc(ai,a2)  =  {(r1,r2  +  a)}  and  Tc(fii,  fc)  =  {(ri,r2  +  b)}, 

for  some  real  a,b.  Indeed,  using  Lemma  4.2,  we  have  Tc(&i,ct2)  —  {(ri  +  a, r2)}  or 
{(r-i, r2  +  a)}  and  Tc{/3 i,/52)  =  {(ri  +  b,r2)}  or  {(ri,r2  +  b)}.  Suppose,  for  example, 
that  Tc(ai,a2)  =  {(^l  +  a,, ^2) }  and  Tc(/3i,f32)  —  {(ri,r2  +  b).  Since  the  two  squares 
intersect  each  other,  we  must  have  |aq  —  /?i|  <  1  and  \a2  —  f32\  <  1.  The  orthogonality 
property  also  implies  that  either  (a,  a\  —  (3i)  or  (—6,  a2  —  /32)  is  in  the  zero  set  of  V9lg\. 
But  since  we  have  |o|,  |6|  <  1,  this  would  imply  that  |au  —  j3\\  >  \  or  |aq  —  j32 \  >  1, 
which  cannot  happen.  As  T(C)  +  [0,  l]2  is  a  tiling  of  M2,  for  any  square  (cr i,a2)  +  [0,  l]2 
intersecting  the  square  (Ai,A2)  +  [0,  l]2  and  with  (01,02)  G  T(C),  we  can  find  another 
square  (^i,^)  +  [0,  l]2 ,  with  (<5i ,  A2)  G  T((7)  and  with  (<5i ,  A2 )  +  [0,  l]2  intersecting  both 
squares  (0 1,  02)  +  [0,  l]2  and  (Ai,  A2)  +  [0,  l]2.  By  the  previous  remark,  the  conclusion  of 
the  lemma  holds  for  all  the  squares  that  neighbor  the  square  (Ai,  A2)  +  [0,  l]2.  Replacing 
this  original  square  by  one  of  the  neighboring  squares  and  continuing  this  process,  we 
obtain  the  conclusion  of  the  lemma  for  all  the  squares  in  the  tiling  T((7)  +  [0,  l]2  by  an 

induction  argument.  This  proves  our  claim.  □ 
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Suppose  that  the  system  f/(X[o,i]2,  A)  gives  rise  to  a  non-standard  Gabor  orthonor¬ 
mal  basis  of  L2(1R2).  Then,  some  of  the  squares  will  have  overlaps  and,  without  loss  of 
generality,  we  can  assume  that 

I [o,  i]2  n  [o,  i]2  +  (£i,£2)|  >  o 

for  some  (£1,^2)  in  the  translation  component  of  A. 

Lemma  4.5.  If  (0,0, 0,0)  G  A,  then  the  sets  T[0ji)2  (Ax,  A2)  where  (Ai,A2)  G  T ( [0 , 1 ) 2 ) 
are  either  all  of  the  form  {(£,0)}  or  all  of  the  form  { (0,  £) }  with  some  t  (depending  on 
(Ai,A2)j  with  |£|  <  1.  In  the  first  case,  if  there  exists  some  (Ai,A2)  G  T([0,  l)2)  with 
T[01)2(Ai,  A2)  =  (£,  0)  and  £  ^  0,  then 

T([0,1)2)  =  (J  (Z  +  (tk,o)  x  {A:}  (4.3) 

fee  z 

for  some  0  <  fik,o  <  1-  Moreover,  we  can  find  0  <  £*,  <  1  such  that 

2~[o,i)2  ((^  +  Am)  x  {k})  —  {(£fc,  0)},  k  G  Z,  (4.4) 


and 


A  n  ([0,  l)2  x  M2)  =  {(£fc,  0,  j  +  fj,k,o,k)  :  j,k  G  Z}.  (4.5) 

(In  the  second  case,  T([0,  l)2)  =  Ufcezi^}  x  (^  +  am)  and  T[0,i)2({fc}  x  (Z  +  Hk,o))  — 
{(0, £fc)},  A  n  ([0,  l)2  x  M2)  =  {(0, £fc,  k,j  +  am)  ■  j,ke  Z}). 

Proof.  If  A  =  (0,0),  we  have  T[0ii)2(A)  =  {(0,0)}  as  (0,  0,  0,  0)  G  A.  By  Lemma  4.4,  any 
(Ai,  A2)  G  T([0,  l)2)  with  the  square  (Ai,  A2)  +  [0,  l]2  intersecting  [0,  l]2  on  the  Ai,  A2-plane 
satisfies  T[0 .1)2  (Ai,  A2)  =  {(£,0)}  or  T[0ji)2  (Ai,  A2) { (0,  £)}  with  |£|  <  1.  Without  loss  of 
generality,  we  assume  that  the  first  case  holds.  As  T([0,  l)2)  +  [0,  l]2  is  a  tiling  of  R2, 
for  any  square  C  —  (AX,A2)  +  [0, 1] 2 ,  with  (Ai,A2)  G  T([0,  l)2),  we  can  find  squares 
Ci  —  (Aijj,A2,i)  +  [0,  l]2  for  £  =  0, . . . ,  k  with  (A1)j,A2,i)  G  T([0,  l)2)  and  such  that 
Co  —  [0,  l]2,  Ck  —  C,  and  with  Ci  and  Ci+ 1  touching  each  other  for  all  i  —  0, . . . ,  k  —  1. 

We  have  Tjoq^Ayi,  A2,i)  =  { (£1 , 0) }  for  some  number  £x  with  |£x|  <  1.  Since  C2  and 
Co  both  intersect  Cx,  T[0,i)2(Aii2,  A2,2)  =  {(£2,0)}  by  Lemma  4.4  again.  Inductively,  we 
have  T[0  i)2(Aiij,  \2j)  —  {(£*,  0)},  i  —  1, . . . ,  k,  which  proves  the  first  part. 

Consider  the  case  where,  for  any  (AX,A2)  G  T([0,  l)2),  there  exists  a  number  £  = 
£(A i,A2)  such  that  T[01)2  (Ai,  A2)  =  {(£,0)}  and  assume  that  £(AX,A2)  ^  0  for  at  least 
one  couple  (Ai,A2)  G  T([0,  l)2).  Suppose  that  T([0,  l)2)  is  not  of  the  form  in  (4.3).  By 
Corollary  4.3  and  Proposition  3.2,  we  must  have  T([0,  l)2)  =  \JkeZ{k}  x  (Z  +  au)  with 
0  <  afc  <  1  and  at  least  one  ak  ^  0.  Consider  the  distinct  points 

(£,  0,  k,  Ofc  +  j)  and  (£',  0,  k  —  1,  afe_i  +  j),  both  in  A. 
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We  must  have  that  either  (t  —  t',  1)  G  Z(V9lgi)  or  (0 ,ak  —  ctk- 1)  G  Z(Vgig\).  However, 
since  \ak~ dk-i\  <  1,  the  second  case  is  impossible.  This  means  that  (t  —  t',  1)  G  Z(V9lg i) 
which  is  possible  only  if  t  —  t'.  Therefore  the  fact  that  (t,  0,  k,  a^-j)  G  A  implies  that 
t  —  tj  for  some  real  tj.  We  now  prove  by  induction  on  \j\  that  t3  —  0  for  all  j  G  Z.  The 
case  j  —  0  is  clear  as  (0,0,  0,0)  G  A  by  assumption.  If  our  claim  is  true  for  all  \j\  <  J 
where  J  >  0,  choose  k  G  Z  such  that  ak+i  ^  0  and  ak  —  0  if  such  k  exists.  Suppose  first 
that  j  >  0.  There  exist  thus  t  G  [0, 1)  such  that 

(tj+i,  0,  k,  j  +  1)  and  (0, 0,  k  +  1,  a^+i  +  j)  both  belong  to  A. 

This  implies  that  either  (t,  —1)  G  Z(V9lgi)  or  (0,Ofc+i  —  1)  G  Z(V9lgi).  This  last  case  is 
impossible  and  the  first  one  is  only  possible  if  t  —  0,  showing  that  tj+ 1  =  0.  Similarly  by 
considering  the  points 

(tj-i,  0,  k  +  1,  ctk+i  +  j  —  1)  and  (0,0 ,k,j)  which  both  belong  to  A, 

we  can  conclude  that  tj- 1  =  0  for  j  <  0.  If  k  as  above  does  not  exist,  there  exists  a 
choice  k'  G  Z  such  that  ay-i  ^  0  and  ay  —  0.  By  considering  the  points 

(tj+i,  0,  k',j  +  1)  and  (0, 0,  k'  -  1,  ak>-i  +  j)  if  j  >  0 


and  the  points 


k'  -  l,ay-i  +j  -  1)  and  (0,  0,  k',j)  if  j  <  0 

which  all  belong  to  A,  we  conclude  that  tj  —  0  if  \j\  —  J  +  1.  This  proves  (4.3). 

If  we  are  in  the  first  case,  i.e. 

T( [0,  l)2)  =  (J  (Z  +  Hk,o)  x  {A:}, 

let  m,  m'  be  distinct  integers.  We  have  then 

-^[o,i)2 ~f  M«,0i ^)  —  {(fmi  0)}  and  T|o  ^2(777,  -t~  (tnto,Ti)  —  {(tm/,0)} 

which  implies  that  Vgigi(tm  —  tm>,m  —  m')  —  0  or  Vgigi(0,0)  =  0.  The  second  case  is 
clearly  impossible  while  the  first  one  is  possible  only  when  tm  =  t'm.  This  shows  that  (4.4) 
and  (4.5)  follow  immediately  from  (4.3)  and  (4.4).  □ 

Note  that  Lemma  4.5  implies  that  T([0,  l)2)  =  T({(x,  0)  :  0  <  x  <  1})  and 
T((0,  l)2)  =  0  if  (0,0, 0,0)  G  A. 
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Lemma  4.6.  Under  the  assumptions  of  Lemma  4-5,  suppose  that  there  exists  (Ai,A2)  G 
r([0,  l)2)  with  T[0)i)2  (Ai,  A2)  =  (t,  0)  and  t  ^  0.  Then  we  can  find  numbers  tk  with 
0  <  tk  <  1  and  Hk,m>  k,m  eZ,  with  0  <  pk,m.  <  L  such  that 

A  n  (R  x  [0, 1)  x  R2)  =  {(m  +  tk,  0,j  +  Uk,m,  k)  :  j,  k,m  G  Z}. 

Proof.  By  the  result  of  Lemma  4.5,  we  have  the  identities  (4.4)  and  (4.5).  Let  T  — 
{tk,  k  G  Z}  C  [0, 1)  where  tk,  k  G  Z,  are  the  numbers  appearing  in  (4.4).  Let  si,  S2  G  T 
with  .si  <  S2-  Consider  the  half-open  squares  C  —  (si,0)  +  [0,  l)2  and  C'  —  (si,0)  + 
((0, 1]  x  [0, 1)).  Then  we  know  that  T((7)  +  [0,  l]2  and  T(C")  +  [0,  l]2  both  tile  R2.  Let 
P0  =  {(si,y)  :  0  <  y  <  1}  and  Pi  —  {(si  +  l,y)  :  0  <  y  <  1}.  Note  that  T(P0)  = 
r({(si,0)}).  Moreover, 

r(C)  =  t(p0)  U  r(c  \  p0),  t(c")  =  r(c"  \  px)  u  r(Pi) 

and  since  C  \  Po  —  C  \  P\,  T(Po)  =  T(Pi).  We  have 

Tc'(T(Pi))  C  {(si  +  1  ,y),  0  <  y  <  1} 

but  since  (s2,0)  G  C' ,  we  must  have  Tc'(r(Pi))  =  (si  +  1,0)  by  Lemma  4.4.  Since 


r(P0)  =  {( j  +  (tk,o,  k)  :  j,k  G  Z,  tk  =  s  1} 


and  7T2(T(Po))  =  7T2(r(Pi)),  where  712  is  the  projection  to  the  second  coordinate,  we  have 

T({(1  +  si,  0)})  =  T(Pi)  —  {{j  +  fJk,  1,  k )  :  j,k  G  Z,  tk  —  Si}, 

for  some  constants  ptki  1  with  0  <  ftkp  <  1  using  Proposition  3.2.  Applying  this  argument 
to  Si  =  0  and  S2  =  t,  we  obtain  that 

A  n  ({1}  x  [0, 1)  x  R2)  =  {( j  +  fj,ktU  k )  :j,ke  z,  tk  =  0}. 

Similar  arguments  applied  to  Si  =  s  and  S2  =  1  show  that,  for  any  s  G  T,  we  have 


A  n  ({s  +  1}  x  [0, 1)  x  R2)  —  {(j  +  pk,i,  k)  :  j,k  G  Z,  tk  —  s}, 

and  that  A  D  ({s  +  1}  x  [0, 1)  x  R2)  is  empty  if  s  G  [0, 1)  \  T.  The  same  idea  can  also  be 
used  to  show  the  existence  of  constants  pk,-i  with  0  <  Mfc,!  <  1  such  that 


An  ({s  -  1}  x  [0,1)  x  R2) 


{(j  +  Uk-i,  k )  :  j,k  G  Z,  tk  =  s},  s  G  T, 

0,  s  G  [0, 1)  \  T, 


and,  more  generally  using  induction,  that,  for  any  m  G  Z,  we  can  find  constants  p,k,m 
with  0  <  (jLk  m  <  1  such  that 
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A  fl  ({s  +  m}  X  [0,  1)  X  R2) 


{  (j  b  (tk,m  i  k~)  .  j,  k  G  Z,  tk  —  s} ,  s  G  T, 

0,  S  e  [o,  l)  \  T. 


This  proves  our  claim.  □ 


We  can  now  complete  the  proof  of  the  main  result  of  this  section  which  gives  a 
characterization  for  the  subsets  A  of  R4  with  the  property  that  the  associated  set  of  time- 
frequency  shifts  applied  to  the  window  X[o,i]2  yields  an  orthonormal  basis  for  L2(R2). 

Proof  of  Theorem  1.4.  It  follows  from  Lemma  4.4  that  either  all  Tj 0,1)2  (Ai,  A2),  (Ai,  A2)  G 
T([0,  l)2)  are  either  of  the  form  {(t,  0)}  or  all  are  of  the  form  {(0,  t)}  with  some  t  ^  0. 
In  the  first  case,  we  deduce  from  Lemma  4.6  that 

A  n  (R  x  [0, 1)  x  R2)  =  {(m  +  tk,  0,j  +  fLk,m,  k)  :  j,  k,  meZ} 

for  certain  numbers  tk  and  iik,m  in  the  interval  [0, 1).  We  now  show  that  A  will  be  of  the 
first  of  the  two  possible  forms  given  in  the  theorem.  (Similarly,  the  second  form  follows 
from  the  second  case  of  Lemma  4.6.) 

Letting  C  —  [0,  l)2  and  C'  —  [0,1)  x  (0,1],  we  note  that  both  T(C)  +  [0,  l]2  and 
r(C")  +  [0,  l]2  tile  R2  but  T((0,  l)2)  is  empty.  Hence,  T(C")  =  T({(x,  1)  :  0  <  x  <  1}). 
It  means  that  any  set  Tc'(Ai,A2)  with  (Ai,A2)  G  T(C")  is  of  the  form  { (t ,  1) }  for  some 
t  =  t (Ai ,  A2)  with  0  <  t  <  1.  We  now  have  two  possible  cases:  either  the  cardinality  of 
7b'(r(C")  is  larger  than  one  or  equal  to  one.  In  the  first  case,  we  can  find  two  distinct 
elements  of  Tc'(r(C"))  and  we  can  then  replicate  the  proof  of  Lemma  4.6  to  obtain  that 

An  (R  x  [1,2)  x  R2)  =  {(m  +  tfc,  l,j  +  fJ>k,m,i,k)  ■ j,ke  Z}. 

In  the  other  case,  Tc'(r(C"))  =  { (t  1 , 1)}  for  some  t\  with  0  <  ti  <  1.  If  we  translate  C' 
horizontally  and  use  the  same  argument  as  in  the  proof  of  Lemma  4.6,  we  see  that 

A  n  (R  x  [1,  2)  x  R2)  =  {{m  +  h,  1)}  x  Amp, 

where  Amp  is  a  spectrum  for  the  unit  square  [0,  l]2.  This  last  property  is  equivalent  to 
A +  [0,  l]2  being  a  tiling  of  R2  by  the  result  in  [12]. 

We  can  then  prove  the  theorem  inductively  by  translating  the  square  C'  in  the  vertical 
direction  using  integer  steps.  □ 
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Abstract:  In  this  study,  the  /7-singular  values  of  random  matrices  with 
Gaussian  entries  defined  in  terms  of  the  lp-p- norm  for  p>  1 ,  as  is  studied. 
Mainly,  using  analytical  techniques,  we  show  the  probabilistic  estimate, 
precisely,  the  decay,  on  the  upper  tail  probability  of  the  largest  strictly 
convex  singular  values,  when  the  number  of  rows  of  the  matrices  becomes 
very  large  and  the  lower  tail  probability  of  theirs  as  well.  These  results 
provide  probabilistic  description  or  picture  on  the  behaviors  of  the  largest 
p-singular  values  of  random  matrices  in  probability  for  p>  1.  Also,  we  show 
some  numerical  experiential  results,  which  verify  the  theoretical  results. 

Keywords:  Probability,  Random  Matrices,  Singular  Value,  Banach  Norm 


Introduction 


The  largest  singular  value  and  the  smallest  singular 
value  of  random  matrices  in  L-norm,  including  Gaussian 
random  matrices,  Bernoulli  random  matrices, 
subgaussian  random  matrices,  etc,  have  attracted  major 
research  interest  in  recent  years  and  have  applications  in 
compressed  sensing,  a  technique  for  recovering  sparse  or 
compressible  signals.  For  instance,  (Soshnikov,  2002; 
Soshnikov  and  Fyodorov,  2004)  studied  the  largest 
singular  value  of  random  matrices  and  (Rudelson  and 
Vershynin,  2008a;  2008b;  Tao  and  Vu,  2010)  and  some 
others,  studied  the  smallest  singular  values. 

In  the  study  of  the  asymptotic  behavior  of 
eigenvalues  of  symmetric  random  matrices,  Wigner 
symmetric  matrix  is  a  typical  example,  whose  upper  (or 
lower)  diagonal  entries  are  independent  random 
variables  with  uniform  bounded  moments.  Wigner 
proved  in  (Wigner,  1958)  that  the  normalized 
eigenvalues  are  asymptotically  distributed  in  the 
semicircular  distribution.  Precisely,  let  A  be  a  symmetric 
gaussian  random  matrix  of  size  run  whose  upper 
diagonal  entries  are  independent  and  identically- 
distributed  copies  of  the  standard  gaussian  random 
variable,  then  the  empirical  distribution  function  of  the 


eigenvalues  of 


1 

Tn 


A  is  asymptotically: 


p(x)  := 


—  sl‘t  —  x1dx 
2  n 


0 


for  |  x  |<  2 
for\  x  |>  2 


(1.1) 


As  the  matrix  size  n  goes  to  infinity.  This  is  the  well- 
known  Wigner’s  Semicircle  law,  which  provides  the 
precise  description  of  the  statistical  behavior  of 
eigenvalues  of  matrix  of  large  size.  In  another  case,  for  a 
random  matrix  whose  entries  are  independent  and 
identically-distributed  (i.i.d.)  copies  of  a  complex 
random  variable  with  mean  0  and  variance  1,  Tao  and 


Vu,  (2008;  Tao  et  at.,  2010)  that  the  eigenvalues  of 


sjn 

a  converges  to  the  uniform  distribution  on  the  unit  circle 
as  n  goes  to  oo  and  that  holds  not  only  for  the  random 
matrices  with  real  entries  but  also  for  complex  entries. 
Their  result  has  also  generalized  (Girko,  1985)  and  solved 
the  circular  law  conjecture  open  since  the  1950's,  that  the 
smallest  eigenvalue  converges  to  the  uniform  distribution 
over  the  unit  disk  as  n  tends  to  infinity  (Bai,  1997). 

The  largest  singular  values  of  matrices  are  actually 
their  /7-norm,  which,  from  a  geometric  perspective, 
has  connectionsa  with  the  Minkowski  space,  complex 
lp  space,  in  differential  geometry,  for  which  one  can 
refer  to  (Liu,  2013;  2011),  because  one  can  view  the 
/7-norm  of  a  matrix  as  a  generalization  of  the  p- norm 
of  a  vector. 

For  random  matrices  whose  entries  are  i.i.d.  random 
variable  satisfying  certain  moment  conditions,  the  largest 
singular  value  was  studied  in  (Geman,  1980;  Yin  et  at., 
1988).  Tracy  and  Widom  (1996)  that  the  limiting  law  of 
largest  eigenvalue  distributions  of  Gaussian  Orthogonal 
Ensemble  (GOE)  is  given  in  terms  of  a  particular 
Painleve  II  function,  which  is  the  well-known  Tracy- 
Widom  law.  Furthermore,  the  distribution  of  the 


Science 

Publications 


2015  Yang  Liu.  This  open  access  article  is  distributed  under  a  Creative  Commons  Attribution  (CC-BY)  3.0  license. 


194 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Yang  Liu  /  Journal  of  Mathematics  and  Statistics  2015,  1 1  (1):  7.15 

DOI:  10.3844/jmssp.2015.7.15 


eigenvalue  of  Wishart  matrices,  WN„  =  AA*,  where  A  = 
ANn  is  a  Gaussian  random  matrix  of  size  Nxn,  was 
studied  in  (Johansson,  2000;  Johnstone,  2001).  They 
showed  that  the  distribution  of  largest  eigenvalue  of 
Wishart  matrices  converges  to  the  Tracy-Widom  law  as 


n 

—  tend  to  some  positive  constant.  More  generally,  the 


non-gaussian  random  matrices  were  studied  in 
(Soshnikov,  2002).  Seginer  (2000)  compared  the 
Euclidean  operator  norm  of  a  random  matrix  with  i.i.d. 
mean  zero  entries  to  the  Euclidean  norm  of  its  rows  and 
columns.  Later,  (Latala,  2005)  gave  the  upper  bound  on 
the  expectation  (or  average  value)  of  largest  singular 
value  namely  the  norm  of  any  random  matrix  whose 
entries  are  independent  mean  zero  random  variables  with 
uniformly  bounded  fourth  moment. 

The  condition  number,  which  is  the  ratio  of  the 
largest  singular  value  over  the  smallest  singular  value  of 
a  matrix,  is  critical  to  the  stability  of  linear  systems.  In 
(Edelman,  1988),  the  distribution  of  the  condition 
number  of  Gaussian  random  matrices,  was  particularly 
investigated  in  numerical  experiments.  As  a  typical 
example  of  subgaussian  random  matrices,  the 
invertibility  of  Bernoulli  random  matrices  was  also 
studied.  Tao  and  Vu  (2007)  the  probability  of  Bernoulli 
random  matrices  to  be  singular  is  shown  to  be  at 


most 


where  n  is  the  size  of  the  matrices. 


Their  result  shows  that  the  probability  of  the  smallest 
singular  value  of  Bernoulli  random  matrices  to  be  zero  is 
exponentially  small  as  n  tends  to  infinity.  Recently,  the 


singularity  probability 


— +  o(l)  |  has  been  improved  to 


^f  +  o(l)j  by  (Bourgain  et  al.,  2010). 

The  recent  studies  of  the  smallest  singular  value  have 
also  been  motivated,  in  a  large  sense,  by  some  open 
questions  or  conjectures.  Spielman  and  Teng  (2002)  the 
following  conjecture  was  proposed  in  the  International 
Congress  of  Mathematicians  in  2002. 


Conjecture  1. 1 


Let  4  be  Bernoulli  random  variable,  in  other  words, 
P(f  - 1)  =  P(j  =  -1)  =  ^  .  Then: 


pWM««si)s 


t  +  C' 


(1.2) 


probabilistic  estimate  on  the  smallest  value  in  /2-norm 
for  square  matrices  of  centered  random  variables,  with 
imit  variance  and  appropriate  moment  assumptions.  In 
particular,  they  proved  the  Spielman-Teng  conjecture  up 
to  a  constant.  The  lower  tail  probabilistic  estimate  on  the 
smallest  value  in  /2-norm  for  square  matrices  was 
estimated  in  (Rudelson  and  Vershynin,  2008b).  These 
results  have  shown  that  the  smallest  singular  value  of  the 

nxn  subgaussian  random  matrices  is  of  order  n  2  in  high 
probability  for  large  n.  In  a  more  explicit  way,  the 
distribution  of  the  smallest  singular  value  of  random  was 
given  in  (Tao  and  Vu,  2010)  by  using  property  testing 
from  combinatorics  and  theoretical  computer  science. 
The  pregaussian  matrices  were  used  to  recover  sparse 
image  in  (Rauhut,  2010)  and  matrix  recovery,  on 
which  one  can  refer  to  (Oymak  et  al.,  2011;  Lai  et  al., 
2012).  Very  recently,  Rudelson  and  Vershynin  (2010) 
gave  a  comprehensive  survey  on  the  extreme  singular 
values  of  random  matrices. 

It  is  well-known  that  the  classic  singular  value  is 
defined  in  terms  of  /2-norm,  then  a  natural  question 
would  be  what  if  one  defines  the  singular  value  by  the  lq- 
quasinorm  for  0<g<l  and  /p-norm  for  p>  1 .  There  were 
some  remarkable  results  by  other  researchers  on  the 
largest  singular  values  of  random  matrices  in  the  /2- 
norm.  Geman  (1980;  Yin  et  al.,  1988)  showed  that  the 
largest  singular  value  of  random  matrices  of  size  mxN 
with  independent  entries  of  mean  0  and  variance  1  tends 
to  4m  +  4n  almost  surely.  The  largest  and  smallest  q- 
singular  values  of  pregaussian  random  matrices  for 
0<q<l  were  studied  in  (Lai  and  Liu,  2014),  which  has 
applications  in  a  technique  of  signal  processing  (Foucart 
and  Lai,  2010;  2009;  Lai  and  Liu,  2011)  and  other  areas. 
Similar  to  the  ^-singular  value  when  0<g<l,  the  strictly 
convex  largest  ^-singular  value,  in  which  p>  1,  can  be 
defined  and  we  will  show  the  probabilistic  estimate, 
precisely,  the  decay,  on  the  upper  tail  probability  of  the 
largest  strictly  convex  /i-singular  value,  when  the 
number  of  rows  of  the  matrices  becomes  very  large  and 
the  lower  tail  probability  of  theirs  as  well.  These  results 
provide  probabilistic  description  or  picture  on  the 
behaviors  of  the  largest  p-singular  values  of  random 
matrices  in  probability. 

The  Largest /t-Singular  Value 

The  p-singular  values  of  a  matrix,  in  general,  can  be 
defined  in  the  way  of  maximum  of  minimums  or 
supremum  of  infimums.  In  largest  p-sigular  values  can 
be  defined  as  follows: 


for  all  t> 0  and  some  0<c<l. 

In  the  breakthrough  work  on  the  estimate  on  the 
smallest  singular  value,  (Rudelson  and  Vershynin, 
2008a),  Rudelson  and  Vershynin  obtained  the  upper  tail 


Definition  2. 1 

For  an  mxN  matrix  A,  the  largest  p-singular  value  of 
A  denoted  as  sf  (A)  is  defined  as: 
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( vi) |||  Ax  \\p :  x  el?  with\\  jv  ||p=  l}  (2.1) 

For  given  p>  1. 

Lai  and  Liu  (2014),  the  following  lemma  on  a  linear 
bound  for  partial  binomial  expansion  was  established. 

Lemma  2.2 

For  every  positive  integer  n: 


the  lower  tail  probability  of  the  largest  /7-singular 
value  for  p>  1 . 

Theorem  2. 4 

Let  ^  be  a  pregaussian  variable  normalized  to  have 
variance  1  and  A  is  an  mxN  matrix  with  i.i.d.  copies  of  i; 
in  its  entries,  then  for  every  p>l  and  any  s>0,  there 
exists  y>0  such  that: 


<  8a- 


(2.2) 


(.4)  <  ym 

V  J 


<  e 


(2.9) 


For  all xe  [0,  1], 

The  above  lemma  can  be  applied  to  estimate 
probabilities. 

Lemma  2. 3 

Suppose  Ci,  f2,  (n  are  i.i.d  copies  of  a  random 
variable  c,  then  for  any  e>0: 


P  X  | £  r<  —  <&P(\Z\<e) 


(2.3) 


For  any  given  p>  1 . 

Proof.  Given  p>l,  we  have  the  relation  on  the 
probability  events  that: 


t=l  X 


(2.4) 


Is  contained  in: 

U  {(4-4):i4i^^-i4r>^-i 4r>^=£  (2-5) 


where,  i2,  ■  •  •,  4}  is  a  subset  of  {1,  2  •••,  «}  and  {ik+h 
— ,  /„}  is  its  complement. 

Let  x  =  P(|4f  -£\  then  by  the  union  probability: 


Which  y  only  depends  on  p,  s  and  the  pregaussian 
variable  £ 

Proof.  Since  at]  is  pregaussian  with  variance  1,  then 
any  s>0,  there  is  some  8>0,  such  that: 


But  we  know: 


(2.10) 


(2.11) 


For  all  /,  because  by  the  definition  of  the  largest  p- 
singular  value  2.1,  choosing  x  to  be  the  standard  basis 

\_ 

vectors  of  R^  gives  us  max;  atj  |pjp  <  ^(A) . 

Therefore,  by  Lemma  2.3: 


<8P(latr<A)<s 


<P  £la,J'<f 


(2.12) 


Thus  let  y  =  f— V  ,  then  (2.9)  follows. 

p(£)=  x  Lm1-*)’ 

(2.6) 

•#'w 

For  the  upper  tail  probability  of  the  largest  p- 
singular  value,  p>l,  we  can  derive  the  following 

And  applying  Lemma  2.2,  we  have: 

lemma  first  by  using  the  Minkowski  inequality  and 
discrete  Holder  inequality. 

P(e)<8x  =  8P(\£l\<e) 

(2.7) 

Lemma  2. 5 

Since  the  event  (2.4)  is  contained  in  the  event  (2.5): 

For  p>\,  (2.1)  defines  a  norm  on  the  space  of  m  x  N 
matrices  and: 

<pL)<8/>(i£r<e) 

(2.8) 

p-l 

V»= 1  z  J 

max  [|  Oj  ||p<  (.4)  <NP  max  |  \\p  (2. 1 3) 

To  estimate  the  lower  tail  probability  of  the  largest 
^-singular  value,  we  have  the  following  theorem  on 


In  which  aj,j  =  1,  2,  — ,  N,  are  the  column  vectors  of  A. 
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Applying  the  above  lemma,  an  estimate  we  can 
derive  easily  for  Bernoulli  random  matrices,  whose 
every  entry  equals  to  1  or-1  with  equal  probability  (Tao 
and  Vu,  2009),  is  the  following  theorem  on  the  upper  tail 
probability  of  the  largest  /7-singular  value  of  Bernoulli 
matrices  for  p>  1 . 

Theorem  2. 6 

Let  ^  be  a  Bernoulli  random  variable  normalized  to 
have  variance  1  and  A  be  an  mxN  matrix  with  i.i.d. 
copies  of  £;  in  its  entries,  then: 


( 

p\ 

Ax\\ P= 

1 

IX  / 

li=I 

j-i 

/ 

,ZZC  J  =" 

Therefore  we  have: 

E\\Ax\\p<m 


(2.18) 


(2.19) 


1  1  Zzl 

mp  <s[p)(A)<mpNy  (2.14) 

\_ 

One  may  conjecture  that  the  bound  might  be  m p . 
However,  considering  the  Bernoulli  matrices,  whose 
entries  are  in  Bernoulli  distribution,  as  special 
subgaussian  matrices,  the  expectation  of  the  largest  p- 

\_ 

singular  value  may  not  be  mp  .  Indeed,  let  dbeanmxm 
Bernoulli  matrix  and  x  be  a  non-zero  vector  in  Rm.  The 
expectation  of  the  largest  /7-singular  value: 


For  all  x  e  S"  1 .  By  Chemoff  bound,  we  get: 

P(\\  Ax  ||p>  Km)  <  P(||  Ax  ||„>  KE  ||  Ax  ||,) 

<  exp(-cKm) 

For  any  K> 2  and  some  absolute  constant  c>0. 

By  Lemma  4.10  in  (Pisier,  1999),  there  is  a  subset  N 
which  is  a  5-net  of  Sf'  with  cardinality: 


card(N )  < 


(2.21) 


e[s[p)(A))<E^^-  (2.15) 

II  -T II? 

For  all  xeRm  and  particularly  for  x  =  (1, 1)  6  Rm, 
we  have: 

E ==  «Xe(Y  +-  +  e  J  ]'  (2-1 6) 

Now  let  A):  =  e,,  H — 1-6/,,,  then  ^  +...+  ejp  is 

the  expectation  of  the  /p-norm  of  the  vector  (Xh  X2,  X„). 

We  also  have  the  following  result  on  the  upper  tail 
probability  of  the  largest  /7-singular  value  of  Bernoulli 
matrices  for  p>  1 . 

Theorem  2. 7 

Let  A  be  an  mxm  Bernoulli  matrix  with  every  entry 
equal  to  1  or-1  with  equal  probability,  then  one  has: 


Finally,  using  the  union  bound  of  probability  and  an 
approximation  of  any  point  on  the  sphere  by  points  of 
the  5-net,  we  obtain  (2. 1 7). 

For  the  rectangular  matrices,  we  have  the  following 
theorem  on  the  upper  tail  probability  of  the  largest  p- 
singular  value  of  rectangular  matrices  for  \<p<2. 

Theorem  2. 8 

Let  ^  be  a  pregaussian  variable  normalized  to  have 
variance  1  and  A  is  an  mxN  matrix  with  i.i.d.  copies  of  ^ 
in  its  entries,  then  for  every  l<p<2  and  any  s>0,  there 
exists  K>0  such  that: 


( 

(  1  2~p  i  v 

p\ 

Slw(i)>A' 

mp  +  m2p  N * 

V 

V  J) 

where,  K  only  depends  on  p,  s  and  the  pregaussian 
variable 

Proof.  By  the  discrete  Holder  inequality  and  the 
definition  of  the  largest  /7-singular  value: 


p(siP)(A)  >  Km)<  exp(-cm )  (2.17) 

For  some  K> 0  and  some  absolute  constant  c>0. 

Proof.  Let  A  =  (e,y)mx„,  and  Sf1  be  the  unit  sphere  with 

respect  to  /p-norm  in  Rm,  then  for  any  x e Sp~' ,  by  the 
convexity  of  the  function/ ( t )  :=  f  for  p>  1 : 


>) 


(A)  =  sup 


<  sup 

xellf  ,x*0 


mp  2 1|  Ax 

ll*fe 


=  mp  2sj2>(A) 


(2.23) 


We  also  know  that  there  exists  K> 0  such  that: 
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p\ 

42)Li)>d 

(  i  i)i 

m2+N2 

\<e 

i 

V 

l  JJ 

Therefore,  we  have: 

f 

(  I  11 

p\ 

,f2)(.4)>d 

m1  +  mp  2N2 

! 

V 

jj 

<p\ 

f 

42,(^)>7f 

r  i  i  V 

m1  +  N 2 

V 

l  JJ 

(2.24) 


(2.25) 


To  have  a  full  generalization,  let  us  derive  the 
following  useful  lemma. 

In  general,  for  the  relation  between 

andsf.—  +  — =  \,q  >  1,  one  can  deduce  the  following 
P  <7 

duality  lemma  on  general  rectangular  matrices. 

Lemma  2. 9 


We  have  the  following  remarks  on  the  above  lemma. 
Remark  2.10 

One  can  also  obtain  the  above  lemma  the  operator 
duality  on  the  dual  spaces. 

Remark  2.11 

The  above  lemma  allows  us  to  obtain  the 
probabilistic  estimates  on  s[p)  (A)  for  p> 2  by  taking  the 

transpose  of  A  and  using  the  estimates  on  sf  ( AT ) . 

Thus  using  the  duality  lemma,  we  obtain. 

Theorem  2. 12 

(Lower  tail  probability  of  the  largest  ^-singular  value 
of  rectangular  matrices,  p>2).  Let  L  be  a  pregaussian 
random  variable  normalized  to  have  variance  1  and  A  be 
an  mxN  matrix  with  i.i.d.  copies  of  ^  in  its  entries,  then 
for  every  p>2  and  any  g>0,  there  exists  y>0  such  that: 


For  any  q>  1  and  m  x  N  matrix  A : 


(2.26) 


(  Pzl\ 

s{p^(A)  <  ym  p 


<£ 


(2.31) 


where,  —  +  —  =  1 . 

P  q 

Proof.  By  the  discrete  Holder  inequality,  we  know 

that  if  —  +  _L  =  i  then: 

P  1 


(Ax,y)<\\Ax\\q  (2.27) 


For  all  xek^1  and  yeMn  with  ||y||p  =  1  and 

furthermore  the  equality  holds  for  some  y0  with  \\y0  ||p  = 
1.  Thus: 


^*11^=  sap  ( Ax,y }  (2.28) 

J'sIf.lWI,*! 


which,  y  only  depends  on  p,  s  and  the  pregaussian 
random  variable  c. 

Moreover,  we  have  the  upper  tail  probability  of  the 
largest  p-singular  value  of  rectangular  matrices  for  p> 2. 

Theorem  2.13 

(Upper  tail  probability  of  the  largest  ^-singular  value 
of  rectangular  matrices,  p>2).  Let  f  be  a  pregaussian 
variable  normalized  to  have  variance  1  and  A  is  an  mxN 
matrix  with  i.i.d.  copies  of  c  in  its  entries,  then  for  every 
p>2  and  any  g>0,  there  exists  K>0  such  that: 


( 

(  p± 

i 

p\ 

Slw(i)>A' 

N  p  +  m2N lp 

V 

JJ 

By  the  definition  of  the  largest  ^-singular  value: 

^\A)=supx^SiicliJ\Ax\\q 

=  suP^,^--fuPy^mpA^y) 

In  the  same  way,  we  also  have: 

s1w  (.47 )  =  sup  sup  (Ax,y) 

J-er  ,11  vll^.! 

Finally,  using  (Ax,  y)  =  (ATy,  x)  and  switching  the 
supremums,  we  get  sf1'1  (.4)  =  sj''1  ( AT ) . 


where,  K  only  depends  on  p,  s  and  the  pregaussian 
variable  £ 

Numerical  Experiments 

In  general,  matrix  />-norms  are,  in  fact,  NP-hard  to 
approximate  if  p  1  1,2,®,  on  which  one  can  refer  to 
(Hendrickx  and  Olshevsky,  2010;  Liu,  2014;  Higham, 
1992).  In  this  section,  however,  we  would  like  to  show 
the  results  from  some  numerically  computable 
experiments  on  the  /;-singular  value  for  p>  1  and  q- 
singular  value  for  0<q<\  of  random  matrices. 

For  p  =  2,  we  plot  the  largest  2-singular  value  of 
Gaussian  random  matrices  of  size  nxn,  where  n  runs 


(2.29) 


(2.30) 


li 
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from  1  through  100.  Figure  1  this  graph  shows  that  the  2- 
singular  value  is  . 

For  p  =  1,  in  the  first  numerical  experiment  we  plot 
the  largest  1 -singular  value  of  Gaussian  random  of  size 
nxn ,  where  n  runs  from  1  through  100.  Figure  2  the 
graph  shows  that  the  largest  1 -singular  value  is  O  (n). 

In  the  second  numerical  experiment  for  p  =  1,  we 
plot  the  largest  1 -singular  value  of  Gaussian  random 
matrices  of  size  nxn,  where  n  runs  from  1  through 
200.  Figure  3  the  graph  shows  that  the  largest  1- 
singular  value  is  O  ( n ). 

In  the  third  experiment  for  p  =  1 ,  we  plot  the  largest 
1 -singular  value  of  Gaussian  random  matrices  of  size 


nxn ,  where  n  runs  from  1  through  400.  Figure  4  the 
graph  shows  that  the  largest  1 -singular  value  is  O  (n). 

For  p  =  co,  we  plot  the  largest  oo-singular  value  of 
Gaussian  random  matrices  of  size  nxn,  where  n  runs 
from  1  through  500.  Figure  5  this  graph  shows  that  the 
oo-singular  value  is  O  («). 

Higham  ( 1 992),  the  /5-norm  of  a  matrix  of  size  m  by  n 
was  estimated  reliably  in  O  ( nm )  operations  and  an 
algorithm  that  can  estimate  the  /5-norm  in  a  specific 

i-l 

accuracy,  within  a  factor  of  n  p  ,  was  provided.  Using  this 
algorithm,  we  plot  the  largest  4-singular  value  of  Gaussian 
random  matrices  and  Bernoulli  random  matrices  of  size 
mxn,  where  m  and  n  run  from  1  through  8 1  Fig.  6  and  7. 


Size  of  Gaussian  random  matrices 


Fig.  1 .  Largest  2-singular  value  of  Gaussian  random  matrices 


Fig.  2.  Largest  1 -singular  value  of  Gaussian  random  matrices:  Experiment  1 
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Size  of  Gaussian  random  matrices 


Fig.  3.  Largest  1 -singular  value  of  Gaussian  random  matrices:  Experiment  2 
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Fig.  4.  Largest  1 -singular  value  of  Gaussian  random  matrices:  Experiment  3 
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Fig.  5.  Largest  co-singular  value  of  Gaussian  random  matrices 


Number  of  rows  of  rectangular  matrices  Number  of  columns  of  rectangular  matrices 


Fig.  6.  Largest  4-singular  value  of  Gaussian  random  matrices 


Number  of  rows  of  rectangular  matrices  Number  of  columns  of  rectangular  matrices 

Fig.  7.  Largest  4-singular  value  of  Bernoulli  random  matrices 
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THE  PROBABILISTIC  ESTIMATES  ON  THE  LARGEST  AND 
SMALLEST  g-SINGULAR  VALUES  OF  RANDOM  MATRICES 

MING-JUN  LAI  AND  YANG  LIU 

ABSTRACT.  We  study  the  g-singular  values  of  random  matrices  with  pre- 
Gaussian  entries  defined  in  terms  of  the  ^g-quasinorm  with  0  <  q  <  1.  In 
this  paper,  we  mainly  consider  the  decay  of  the  lower  and  upper  tail  prob¬ 
abilities  of  the  largest  g-singular  value  ,  when  the  number  of  rows  of  the 
matrices  becomes  very  large.  Based  on  the  results  in  probabilistic  estimates  on 
the  largest  g-singular  value,  we  also  give  probabilistic  estimates  on  the  smallest 
g-singular  value  for  pre-Gaussian  random  matrices. 


1.  Introduction 

The  extremal  spectrum  or  the  largest  and  smallest  singular  values  of  random 
matrices  have  been  of  interest  to  many  research  communities  including  numerical 
analysis  and  multivariate  statistics.  For  example,  the  condition  numbers  of  ran¬ 
dom  matrices  were  of  interest  as  early  as  in  von  Neumann  and  Goldstein’1947,  [28] 
and  Smale’1985,  [19],  and  distribution  of  the  largest  and  smallest  eigenvalues  of 
Wishart  matrices  was  studied  in  Wishart’1928,  [30].  Some  estimates  for  the  prob¬ 
ability  distribution  of  the  norm  of  a  random  matrix  transformation  were  obtained 
in  Bennett,  Goodman  and  Newman’1975,  [2],  In  1988,  Ed  el  man  presented  a  com¬ 
prehensive  study  on  the  distribution  of  the  condition  numbers  of  Gaussian  random 
matrices  together  with  many  numerical  experiments  (cf.  [5]).  In  particular,  Edel- 
man  explained  several  interesting  applications  of  eigenvalues  of  random  matrices 
in  graph  theory,  the  zeros  of  Riemann  zeta  functions,  as  well  as  in  nuclear  physics 
(cf.  [6]).  Indeed,  the  well-known  semi-circle  law  (cf.  Wigner’1962,  [29])  states  that 
the  histogram  for  the  eigenvalues  of  a  large  random  matrix  is  roughly  a  semi-circle. 
To  be  more  precise,  let  A  be  a  Gaussian  random  matrix  and  M(x)  denote  the  pro¬ 
portion  of  eigenvalues  of  the  Gaussian  orthogonal  ensemble  ( A  +  AT)/(2v/n)  (the 
symmetric  part  of  A/y/n)  that  are  less  than  x.  Then  the  semi-circle  law  asserts 
that 


This  interesting  property  has  made  a  long  lasting  impact  and  attracted  many 
researchers  to  extend  and  generalize  the  semi-circle  law.  See  recent  papers  of 
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Tao  and  Vu’2008,  [24]  and  Rudelson  and  Vershynin’2010,  [17]  for  new  results 
and  surveys  and  the  references  therein.  It  is  known  that  the  largest  eigenvalue 

of  Ms  =  -VnXs(VnXs)T  converges  to  (1  +  Jy)2  almost  surely  (cf.  Geman’1980, 
s 

[10])  and  the  smallest  eigenvalue  converges  to  (1  —  yjy)2  almost  surely  (cf.  Sil- 
verstein’1985,  [18]),  where  VnXs  is  a  Gaussian  random  matrix  of  size  n  x  s  with 
n/s  A  y  6  (0, 1]  and  VnXs(VnXs)T  is  called  a  Wishart  matrix.  The  behavior  of  the 
largest  singular  value  of  random  matrices  A  with  i.i.d.  entries  is  well  studied.  If  a 
random  variable  £  has  a  bounded  fourth  moment,  then  the  largest  eigenvalue  s'i  (A) 
of  an  n  x  n  random  matrix  A  with  i.i.d.  copies  of  £  satisfies  the  following  property: 

lirn  =  2-v/E£2 

almost  surely.  See,  e.g.,  Yin,  Bai,  Krishnaiah’1988,  [31]  and  Bai,  Silverstein  and 
Yin’ 1988,  [1].  The  bounded  fourth  moment  is  necessary  and  sufficient  in  this  case. 
However,  the  behavior  of  the  smallest  singular  value  for  general  random  matrices 
has  been  much  less  known.  Although  Edelman  showed  that  for  every  e  >  0,  the 
smallest  eigenvalue  sn(A)  of  Gaussian  random  matrix  A  of  size  n  x  n  has 

P  ^sn(A)  <  <  e 

for  any  e  >  0,  the  probability  estimates  for  sn(A)  for  general  random  matrix  A 
were  not  known  until  the  results  in  Rudelson  and  Vershynin’2008,  [14].  In  fact, 
Rudelson  in  [16]  presented  a  less  accurate  probability  estimate  for  sn(A),  and  soon 
both  Rudelson  and  Vershynin  found  a  simpler  proof  of  much  accurate  estimate  in 
[15].  More  precisely,  Rudelson  and  Vershynin  first  showed  (cf.  [15])  the  following 
results: 

Theorem  1.1.  If  A  is  a  matrix  of  size  n  x  n  whose  entries  are  independent 
random  variables  with  variance  1  and  bounded  fourth  moment,  then 

lim  limsupP  (  sn(A)  <  -*j=  )  =  0. 

e— ^0+  ro— yoo  \  Vn  / 

Furthermore,  in  Rudelson  and  Vershynin’2008,  [14],  they  presented  a  proof  of 
the  following 

Theorem  1.2.  Let  A  be  annxn  matrix  whose  entries  are  i.i.d.  centered  random 
variables  with  unit  variance  and  fourth  moment  bounded  by  B.  Then 

lim  limsupP  ( sn(A)  >  ]  =  0. 

K—*+ oo  n—too  \  \n  J 

These  two  results  settled  down  a  conjecture  by  Srnale  in  [18]  (the  results  on  the 
Gaussian  case  were  established  by  Edelman  and  Szarek;  see  [6]  and  [22]).  More 
precise  estimates  for  largest  and  smallest  eigenvalues  are  given  for  sub-Gaussian 
random  matrices,  Bernoulli  matrices,  covariance  matrices,  and  general  random  ma¬ 
trices  of  the  form  M  +  A  with  deterministic  matrix  M  and  random  matrix  A  in  the 
last  ten  years.  See,  e.g.  [25],  [20],  [14],  [26],  [23]  and  the  references  in  [17]. 

In  this  paper,  we  extend  these  studies  on  the  probability  estimate  of  the  largest 
and  smallest  singular  values  of  random  matrices  in  the  t^-norm  and  give  estimates 
for  these  extremal  spectra  in  the  setting  of  the  7(;-quasinorm  for  0  <  q  <  1.  Not 
only  is  it  interesting  to  know  if  the  probability  estimates  for  largest  and  smallest 
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singular  values  of  random  matrices  in  the  -^-norm  can  be  extended  to  the  setting 
of  the  ^g-quasinorm,  there  are  also  some  definite  advantages  of  using  the  general 
£g-quasinorm  when  studying  the  restricted  isometry  property  of  random  matrices  as 
suggested  in  Chartrand  and  Steneva’2008,  [4],  Foucart  and  Lai’2009,  [8]  and  Foucart 
and  Lai’2010,  [9].  In  addition  to  Gaussian  and  sub-Gaussian  random  matrices,  we 
would  like  to  study  the  probability  estimates  for  pre-Gaussian  random  matrices. 
A  random  variable  £  is  pre-Gaussian  if  £  has  mean  zero  and  the  moment  growth 
condition  E(|£|fc)  <  kl Afc/2,  i.e.  (1E( |^| fc))  1/fc  <  C\k  for  k  >  1  (cf.  Buldygin  and 
Kozachenko’2000,  [3]).  Note  that  the  moment  growth  condition  for  a  sub-Gaussian 
random  variable  rj  is  (E  (|??|fc))1^fc  <  BCy/k. 

To  be  precise  on  what  we  are  going  to  study  in  this  paper,  for  any  vector  x  = 
(xi,  ■  ■  ■  ,xn)T  in  Mn,  let 

n 

llXllg  =  ^2\Xi\q 

i= 1 

for  q  G  (0,  oo).  It  is  known  that  for  q  >  1,  |j  *  ||9  is  a  norm  for  Mn  and  ||  ■  j|^  is  a  quasi¬ 
norm  for  W1  for  q  G  (0, 1)  that  satisfies  all  the  properties  for  a  norm  except  the 
triangle  inequality.  Let  A  =  (ajj)i<i<m,i<j<n  be  a  matrix.  The  standard  largest 
(/-singular  value  is  defined  by 

{II  Axil 

tt — 7j  — -  :  x  G  Mn  with  x  ^  0 
11*11  ? 

It  is  known  that  for  q  >  1,  the  equation  in  (1.1)  defines  a  norm  on  the  space  of 
m  x  n  matrices.  In  addition,  we  know 

(1.2)  max||a7'||  <  syq {A)  <  n  q  max||aj||  , 

3  q  3  Q 

where  a3 ,  j  =  1,2,  •  •  •  ,  n,  are  the  column  vectors  of  A.  We  refer  to  any  book  on 
matrix  theory  for  the  properties  of  the  largest  singular  value  sq(A)  when  q  >  1,  for 
example,  [11].  However,  for  q  G  (0, 1),  the  properties  of  s\(A)  are  not  well-known. 
For  convenience,  we  shall  explain  some  useful  properties  in  the  Preliminaries  section. 

The  purpose  of  this  paper  is  to  study  the  matrix  spectrum,  e.g.  sf  (A)  for  random 
matrix  A  with  pre-Gaussian  entries.  Two  sets  of  our  main  results  are  the  following 

Theorem  1.3  (Upper  tail  probability  of  the  largest  (/-singular  value).  Let  £  be  a 
pre-Gaussian  variable  normalized  to  have  variance  1  and  A  be  an  m  x  m  matrix 
with  i.  i.  d.  copies  of  £  in  its  entries.  Then  for  any  0  <  q  <  1 , 

(1.3)  P  >  Cmq\  <  exp(— C'm) 

for  some  C ,  C'  >  0  only  dependent  on  the  pre-Gaussian  variable  f. 

Theorem  1.4  (Lower  tail  probability  of  the  largest  (/-singular  value).  Let  f  be  a 
pre-Gaussian  variable  normalized  to  have  variance  1  and  A  be  an  m  x  m  matrix 
with  i.i.d.  copies  of  £  in  its  entries.  Then  there  exists  a  constant  K  >  0  such  that 

(1.4)  P  (s^{A)  <  Kmq)  <  cm 

for  some  0  <  c  <  1,  where  K  only  depends  on  the  pre-Gaussian  variable  f. 

These  results  have  their  counterparts  in  papers  by  Yin,  Bai,  Krishnaiah’1988, 
[31],  Bai,  Silverstein  and  Yin’1988,  [1]  and  Sosnikov’2002,  [20]  for  the  £2-norm.  It 
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is  interesting  to  know  if  the  above  results  hold  for  general  random  matrices  whose 
entries  are  i.i.d.  copies  of  a  random  variable  of  the  bounded  fourth  moment. 

Next  we  would  like  to  study  the  smallest  singular  values.  In  general  we  can 
define  the  /c-th  (/-singular  value  as  follows. 


Definition  1.1.  The  fc-th  (/-singular  value  of  an  m  x  n  matrix  A  is  defined  by 
(1.5) 


(A)  :=  inf  <  sup 


Px|| 


:  x  E  V  \  {0}  }  :  V  C  Mn,  dim  (V)  >  n  —  k  +  1 


It  is  easy  to  see  that 

(1.6)  (A)  >  »<«>  (/»)>...>  W  >  0. 

The  smallest  singular  value  n)  1S  a^so  sPeci&l  interest  in  various  studies. 

In  the  lower  tail  probability  estimate,  we  divide  the  study  in  two  cases  when  m  >  n 
(tall  matrices)  and  m  =  n  (square  matrices)  under  the  assumption  that  A  is  of  full 
rank.  The  study  is  heavily  dependent  on  the  known  results  on  the  compressible  and 
incompressible  vectors.  In  the  upper  tail  probability  estimate,  we  use  the  known 
estimates  on  the  projection  in  the  l^-norm.  Another  set  of  main  results  is  as  follows. 
For  tall  random  matrices,  we  have 


Theorem  1.5  (Lower  tail  probability  on  the  smallest  (/-singular  value).  Let  us  fix 
0  <  q  <  1.  Let  £  be  the  pre- Gaussian  random  variable  with  mean  0  and  variance 
1 .  Suppose  that  A  is  an  m  x  n  matrix  with  i.  i.  d.  copies  of  £  in  its  entries  with 
m  >  n.  Then  there  exist  some  e  >  0,c  >  0  and  A  £  (0, 1)  dependent  on  q  and  e 
such  that 

(1.7)  P  (s^(A)  <  em1/q\  <  e~cm 

when  n  <  Am. 


For  square  random  matrices,  we  have 

Theorem  1.6  (Lower  tail  probability  on  the  smallest  (/-singular  value).  Let  us  fix 
0  <  q  <  1.  Let  £  be  the  pre-Gaussian  random  variable  with  variance  1  and  A  be 
an  nxn  matrix  with  i.i.d.  copies  of  £  in  its  entries.  Then  for  any  e  >  0,  one  has 

(1.8)  P  fs^(A)  <  'yn~1/q')  <  e, 
where  7  >  0  depends  only  on  the  pre-Gaussian  variable  £. 

The  above  theorem  is  an  extension  of  Theorem  1.1.  Finally  we  have 

Theorem  1.7  (Upper  tail  probability  on  the  smallest  (/-singular  value).  Given  any 
0  <  q  <  1,  let  £  be  a  pre-Gaussian  random  variable  with  variance  1  and  A  be  an 
n  xn  matrix  with  i.  i.  d.  copies  of  £  in  its  entries.  Then  for  any  K  >  e,  there  exist 
some  C  >  0,  0  <  c  <  1,  and  a  >  0  which  are  only  dependent  on  the  pre-Gaussian 
variable  £  such  that 

(1.9)  P  (sM(A)  >  An"1/2)  <  +  c". 

In  particular,  for  any  e  >  0,  there  exist  some  K  >  0  and  no  such  that 

(1.10)  P  (slf\A)  >  ATn-1/2)  <  e 
for  all  n  >  no- 
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The  above  theorem  is  an  extension  of  Theorem  1.2.  Note  that  we  are  not  able 
to  prove 

(1.11)  P  (49 >  Kn~1/q)  <  £ 

under  the  assumptions  in  Theorem  1.7.  However,  we  strongly  believe  that  the  above 
inequality  holds.  We  leave  it  as  a  conjecture. 

The  remainder  of  the  paper  is  devoted  to  the  proof  of  these  five  theorems  which 
give  a  good  understanding  of  the  spectrum  of  pre-Gaussian  random  matrices  in  £q- 
quasinorm  with  0  <  q  <  1.  We  shall  present  the  analysis  in  four  separate  sections 
after  the  Preliminaries  section. 


2.  Preliminaries 

First  of  all,  one  can  easily  derive  the  following 

Lemma  2.1.  For  0  <  q  <  1,  the  equation  in  (l.l)J  defines  a  quasinorm  on  the 
space  of  m  x  N  matrices.  In  particular,  we  have 

(49)  (A  +  B)J "  <  [s[q)  (A))q+  (s[q)  (B)) 9 

for  any  m  x  N  matrices  A  and  B.  Moreover, 

(2.1)  s^(H)  =  max  ||aj||(J 

for  0  <  q  <  1,  where  aj,  j  =  1, . . . ,  N,  are  the  columns  of  matrix  A. 

Proof.  It  is  straightforward  and  not  hard  to  show  that  s^(.A),  q  <  1,  defines  a 
quasinorm  on  matrices  by  using  the  quasi-nornr  properties  of  ||x||g,  the  £g-quasinorm 
on  vectors. 

To  prove  equation  (2.1),  on  one  hand,  we  have 
(2-2)  \\Ax\\l  < 

for  0  <  q  <  1,  which  implies 

(2.3) 

On  the  other  hand,  by  (1.1), 

(2.4)  = 

X 

for  every  j,  where  ej  is  the  j-th  standard  basis  vector  of  ,  and  then  it  follows 
that 

(2.5)  s^(H)  >  max  ||oj||q. 

j 

Thus,  combined  with  (2.3),  we  obtain  the  equation  (2.1)  for  0  <  q  <  1  as  desired. 

□ 

Next  we  need  the  following  elementary  estimate.  Mainly  we  need  a  linear  bound 
for  partial  binomial  expansion. 


N 

^2\xj\q  ■  IMJ  <  ||sj|*max||ajj 

j= i  3 

s^(H)  <  max||oj||q. 

3 

we  have 

sup  \\Ax\\  >\\Aej\\  =\\aj\\q 
eRiV,||*||9=i 
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Lemma  2.2  (Linear  bound  for  partial  binomial  expansion).  For  every  positive 
integer  n, 

Y  (  fc  )  xk  (X  “  x^~k  -  8x 

*== UJ+1  v 

for  all  x  G  [0, 1] . 

Proof.  Let  us  start  with  an  even  integer.  For  every  x  G  [|,  l] ,  we  have 


(2.6)  Y, 


k=n+ 1 


xk{i~x)2n~k  <y 


xk  (1  —  x)2n  =  1  <  8x. 


But  for  x  G  [0,  |  ,  we  let 


f(x):=  Y 


k=n-\- 1 


i  \2n—k 


xk  (1  —  xY 


By  De  Moivre-Stirling’s  formula  (see  e.g.  [7])  and  furthermore  the  estimate  in  [13], 

n!  =  v27r n  y— J  eXn, 

where  I2F+T  <  <  ik-  We  have 


Since 


V2^n(Wne^  4n 
(\/27r n  (If)n  eAri) 

)  for  n  +  1  <  k  <  2 n, 


fix)  <  y 


k=n+l 


xk  (1  -  x)2n~k  <  Y 


k=n-\- 1 


xk  <  n 


for  all  x  G  [0, 1].  Using  (2.7),  we  have 


(2.9)  f{x)<An^xn+\ 

Letting  g(x)  =  4 n^/^xn,  we  have 

In (g(x))  =  n ln(4x)  +  —  In n  —  —  In ir  <  —n In 2  +  -  In n  —  —  In 7r  <  0 

for  x  G  [0, 1/8].  Thus  we  have  f(x)  <  x  <  8x.  Also,  we  can  have  a  similar  estimate 
for  odd  integers.  These  complete  the  proof.  □ 

Remark  2.1.  The  coefficient  on  the  right-hand  side  can  be  improved  by  Markov’s 
inequality,  but  the  estimate  obtained  by  the  analytic  technique  above  is  actually 
good  enough  for  the  purposes  of  this  paper. 

Next  we  review  the  smallest  g-singular  values.  Without  loss  of  generality,  we 
consider  m  >  n.  Then  the  n-th  ^-singular  value  is  the  smallest  g-singular  value 
which  can  also  be  expressed  in  another  way. 
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Lemma  2.3.  Let  A  be  an  mxn  matrix  with  m>n.  Then  the  smallest  q-singular 
value 


(2.10) 


Sn]  iA)  =  inf 


:  x  €  Mn  with  x 


Proof.  By  the  definition, 

(2.11) 

s{q) 

<  inf  |sup  |  :  v  e  ^  \  {0} |  :  V  =  span  (a:)  :  x  €  \  {0}| 

=  inf  |  :  x  €  Mn  with  x  ^  0 1  . 

We  also  know  the  infimum  can  be  achieved  by  considering  the  unit  59-sphere  in  the 
finite-dimensional  space,  and  so  the  claim  follows.  □ 


(A)  =  inf  {sup{^^  :  x  €  V\  {0}}  :  V  C  Mn,  dim  (F)  >  l} 


In  particular,  if  A  is  an  n  x  n  matrix,  we  know 


(2.12) 


s{q\A) 


=  inf 


\\Ax\ 


:  x  £ 


x 


with 


j 

M 

A  1x 

L  1 

sup  < 

1  1 

11  11 

-:i£  Mn  with  x  A  0  > 

I 

F  q 

j 

] 

[ 

J 

The  estimate  of  the  largest  (/-singular  value  can  be  used  to  estimate  the  smallest 
(/-singular  values  based  on  this  relation. 

As  we  see,  the  g-singular  value  is  defined  by  the  f'g-quasinorm,  as  opposed  to  the 
^2-norm,  but  using  a  similar  proof  for  the  relationship  between  the  rank  of  a  matrix 
and  its  smallest  singular  value  in  £ 2 ,  one  has  the  following  relationship  between  the 
rank  of  a  matrix  and  its  smallest  (/-singular  value. 


Lemma  2.4.  For  any  positive  integer  m  and  n,  an  mxn  matrix  A  is  of  full  rank 
if  and  only  if  s^n(m>n)  (A)  >  0. 

Remark  1.  One  could  also  derive  this  lemma  by  the  properties  of  singular  values 
defined  by  the  t'2-norm  and  by  using  the  inequalities  on  the  relations  between  the 
^2-norm  and  the  £g-quasinorm. 

We  shall  need  the  following  result  to  estimate  the  smallest  (/-singular  values. 


Lemma  2.5.  Let  A  be  a  matrix  of  size  mx  N.  Suppose  that  m  >  N.  Then 

s^9}  ,  ,n(A)  <  min  ||a,-||  . 

min(ra, iV)  v  '  —  j  w  J\\q 

Proof.  Choose  ej0  to  be  a  standard  basis  vector  of  such  that  ||AeJO||q  = 
min,-  ||a,-||  and  use  the  definition  of  s ^  ,  ^r^(A)  for  m  >  N.  □ 

J  11  JWq  min(ra,iV)\  /  — 


The  following  generalization  of  Lemma  4.10  in  Pisier’1999,  [12]  will  be  used  in  a 
later  section. 
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Lemma  2.6.  For  0  <  q  <  1,  let  Sq  :=  {x  €  Mn  :  \x\q  =  1}  denote  the  unit  sphere 
o/Mn  in  the  tq-quasinorm.  For  any  S  >  0,  there  exists  a  finite  set  Uq  C  Sq  with 


min  \\x  —  u\\q  <  8  for  all 
u euq  11  \\q  -  J  q 


and 


card (Uq)  < 


Proof.  Let  (iq, . . . ,  Uk)  be  a  set  of  k  points  on  the  sphere  Sq  such  that  | Ui  —j  \q  >  8 
for  all  i  j.  We  choose  k  as  large  as  possible.  Thus,  it  is  clear  that 


min  \x  —  uAq  <  8  for  all  x  6  Sa. 
l<i<k  q  q 


Let  Bq  :=  {x  €  Mn  :  \x\q  <  1}  be  the  unit  ball  of  Mn  relative  to  the  quasinorm 
It  is  easy  to  see  that  the  (<5/2)-balls  centered  at 


!<?• 


i/ q 

ui  +  I  I  Bq,  1  <i<k, 


2 


are  disjoint.  Indeed,  if  x  would  belong  to  the  (<5/2)-ball  centered  at  Xi  as  well  as 
the  (<5/2)-ball  centered  at  x3 ,  we  would  have 


\Ui 


s  s 

Uj\q  <  I  Ui  -  x\q  +  I  Uj  -  x\q  <  -  +  -  =  8, 


*J\q  —  \q  1  I  J  ~\q  —  ^  '  2 

which  is  a  contradiction.  Besides,  it  is  easy  to  see  that 


1/9 


Bq  C  (  1  + 


1/9 


Bq,  1  <  i  <  k. 


By  comparison  of  volumes,  we  get 


fcVol 


1/9 


Bq  ]  = 


t=i 


Bq  <  V0l(  (  1+  - 


1/9 


V 


B, 


Then,  by  homogeneity  of  the  volumes,  we  have 


fcl  l 


n/q  /  A\n/9 

Vol(B,)<  1  +  -  Vol(^), 


n/q 

which  implies  that  k  <  {  1  +  —  )  .  This  completes  the  proof. 


□ 


3.  The  UPPER  TAIL  PROBABILITY  OF  THE  LARGEST  g-SINGULAR  VALUE 
We  begin  with  the  following 

Theorem  3.1  (Upper  tail  probability  of  the  largest  1-singular  value).  Let  £  be  a 
pre-  Gaussian  variable  normalized  to  have  variance  1  and  A  be  an  m  x  m  matrix 
with  i.  i.  d.  copies  of  £  in  its  entries.  Then 

(3.1)  P  ^s^(yl)  >  Crn'j  <  exp(— C'm) 

for  some  C ,  C'  >  0  only  dependent  on  the  pre- Gaussian  variable  £. 

Proof.  Since  a^-  are  i.i.d.  copies  of  the  pre-Gaussian  variable  f.  E aij  =  0,  and  there 
exist  some  A  >  0  such  that  E  | a%3  \ k  <  kl Xk  for  all  k.  Without  loss  of  generality,  we 
may  assume  that  A  >  1.  With  the  variance  Ea^-  =  1,  we  have 

E  \aii\h  <  — —Hk~2k\ 

1  -  2 
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for  H  :=  2A3  and  all  k  >  2.  By  the  Bernstein  inequality  (cf.  Theorem  5.2  in 
we  know  that 


ciij  >  t  <2  exp 


2  (m  +  tH) 


=  2  exp 


2  (m  +  2fA3) 


for  all  t  >  0  and  for  each  j  =  1,  •  ■  ■  ,  N.  In  particular,  when  t  =  Cm, 


ciij  >  Cm  <  2  exp  — 


4CA3  +  2 


Here  a  condition  on  C  will  be  determined  later. 
On  the  other  hand,  by  Lemma  2.1, 


s^(.A)  =  max||aJ|i  =  | a* 

A  L. — J 


for  some  jo-  Furthermore,  for  any  t  >  0,  by  the  probability  of  the  union, 


(3.3)  P  (  ^2  I aij  I  >  t  ]  <  ^2  ^  (  ^2  e*ab  ^  i  ]  • 

\»=1  /  (£i,...,em)G{-l,l}m  \*=1  / 

But  —a^  has  the  same  pre-Gaussian  properties  as  aij0,  precisely,  E  (—aij)  =  0  and 
E  |— aij\k  <  A’!Afe.  Thus  we  have 

P  >  Crn^j  <m  P  labl  —  CiTi'j 

(3  4)  <  2mmP  ^  ^2  aij  —  Crn\ 

<  2-mexp(-3^) 

<  exP  (-  (icS+S  - ln2  -  l)  m)  . 

To  obtain  an  exponential  decay  for  the  probability  P  >  Cm^J,  we  require 

that  +2  —  In  2  —  1  >0,  for  which 


(3.5)  C  >  2A3  +  2A3  ln 2  +  V2  +  2 ln 2  +  4A6  +  8A6  ln  2  +  4A6  In2  2. 

That  is,  choosing  C'  =  4rA.-i+9  —  In 2  —  1,  we  get  (3.1).  □ 

The  previous  theorem  allows  us  to  estimate  the  largest  ^-singular  value  for  0  < 
q  <  1.  The  estimate  can  follow  easily  from  Theorem  3.1,  but  it  is  one  of  the  tail 
probabilistic  estimates  we  wanted  to  obtain,  so  let  us  state  it  as  a  theorem,  which 
is  Theorem  1.3. 

Proof  of  Theorem  1.3.  By  Hdlder’s  inequality,  we  have  ||aj||  <  m?-1  1 1 a,y 1 1 4  for 

0  <  q  <  I-  It  follows  from  Lemma  2.1  that 

(3.6)  s49)(yl)  =  max  ||aj||  <  m"1  s^(H) . 
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From  (3.1),  we  have 
(3.7) 

for  some  C,  C'  >  0  . 


<  P  (mi  1s^\A)  >  Cmi^\ 

=  ?  h\l)(A)  >  Crn'j 

<  exp  (—C'm) 


□ 


4.  The  lower  tail  probability  of  the  largest  (/-singular  value 

Let  us  use  the  result  in  Lemma  2.2  to  give  estimates  on  the  lower  tail  probabilities 
of  the  largest  (/-singular  value. 


Lemma  4.1.  Suppose  Y  ■■■,  in  aue  i.i.d.  copies  of  a  random  variable  £. 
Then  for  any  e  >  0, 


(4.1) 


£l&l<y  )  <SP(|«|<£). 


Proof.  First,  we  have  the  relation  on  the  probability  events  that 


(6,  •••  ,Q  :£|&|  ^ 


i=  1 


ne 

Y 


(4.2) 

is  contained  in 

(4.3) 

n 

(J  [J  {(£i,  •■-,£«)  :  kill  <£,•••  *|&J  <£,  |Cifc+i  |  \iin  |  >e}, 

fc=LfJ+1 

C  {1, . . .  ,n} 

where  {ii,  12,  ■  ■  ■ ,  ik }  is  a  subset  of  {1,  2,  ... ,  n}  and  {ik+ 1,  •  •  • ,  in}  is  its  comple¬ 
ment,  and  let  us  denote  the  set  (4.3)  by  £ . 

Let  x  =  P(|£i|  <  e).  Then  by  the  union  probability, 


(4.4) 


n 


\n  —  k 


p(£)=  E  (  1 

fc=ifj+i  v 
and  applying  Lemma  2.2,  we  have 
(4.5)  P(£)<8x  =  8P(|fi|<e). 

Since  the  event  (4.2)  is  contained  in  the  event  (4.3),  we  have 


(4.6) 


El^l<  y)  <p(£)<8P(|6I  <£)• 


U=1 


□ 


We  start  with  a  lower  tail  probability  for  the  1-singular  values. 

Theorem  4.1  (Lower  tail  probability  of  the  largest  1-singular  value).  Let  £  be  a 
pre- Gaussian  variable  normalized  to  have  variance  1  and  A  be  an  m  x  m  matrix 
with  i.i.d.  copies  of  f  in  its  entries.  Then  there  exists  a  constant  K  >  0  such  that 

(4.7)  P  (s^ (A)  <  Km\  <  cm 
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for  some  0  <  c  <  1,  where  K  only  depends  on  the  pre- Gaussian  variable  £. 


Proof.  Since  a^  has  variance  1,  there  exists  5  >  0  and  0  <  (5  <  1  such  that 


(4.8)  P  (\aij\<5)  =  p. 

Let  Bj  be  the  number  of  variables  in  1  that  are  less  than  or  equal  to  6. 

Then  if  \aij\  —  &  '  A m  f°r  0  <  A  <  1,  then  Bj  >  (1  —  A)  m,  because  otherwise 
YliLi  | aij  |  >  S  ■  Am.  It  follows  that 


(4.9) 


\a,ij\  <  S 


Am 


<  P  (Bj  >  (1  —  A)  m) . 


By  Markov’s  inequality, 

(4.10)  P  (Bj  >  (1  —  A)  m)  <  (1  E^m, 

but  Bj  satisfies  a  binomial  distribution  of  m  independent  experiments,  each  of 
which  yields  success  with  probability  /3;  therefore 

(4.11)  P  (Bj  >  (1  —  A)  m)  <  — . 

By  choosing  suitable  A,  we  can  make  0  <  yfry  <  1.  Thus 


(4.12) 


^  a-jj  |  <  5  ■  Am  <  c 


s,  i=l 


for  some  0  <  c  <  1.  It  follows  that 

P  ^^(yl)  <  XSrn^j  =  P  (maxi<j<Ar  (J2iLi  \aij\ )  <  A 8m) 

<413)  =  n”ip«Er.ii^i)<  v») 

<  cm. 

Thus  letting  K  =  A S,  we  obtain  (3.1). 


□ 


For  general  0  <  q  <  1,  we  have  Theorem  1.4. 

Proof  of  Theorem  1.4.  We  can  use  the  same  method  as  in  the  proof  of  Theorem  4.1. 
Since  atJ  has  nonzero  variance,  there  exists  6  >  0  and  0  <  /3  <  1  such  that 

(4.14)  P  (\aij\g<8)  =  0. 

Then  by  Lemma  4.1  and  substituting  a%3  in  the  proof  of  Theorem  4.1  by  \aij\q , 

P  ^s^(^)  <  (X6)q  mij  =  P (maxi<3-< N  (X^i  I aij\q)  <  XSm) 

(4-15)  =  nr=iP((E^ii%n<A^m) 

<  cm 
1 

for  some  0  <  c  <  1.  Thus  letting  K  =  (XS) q ,  (1.4)  follows.  □ 

Remark  2.  If  one  uses  the  quasinorm  comparison  inequality  s^(yl)  <  s^(yl)  for 
0  <  q  <  1,  one  can  get 

(4.16)  P  (s^}(v4)  <  Km)  <  cm 

for  0  <  q  <  1,  but  with  a  loss  of  the  estimate  on  P  ^s^(yl)  <  Kmq^J . 
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Since  the  bounded  moment  growth  condition  for  pre- Gaussian  variables  is  not 
needed  in  the  proof  of  Theorem  4.1,  the  above  proofs  also  show  that  the  theorem 
holds  for  any  random  variable  with  nonzero  variance.  Therefore,  more  generally, 
we  have 

Theorem  4.2.  Let  £  be  a  random  variable  with  non-zero  variance  and  A  be  an 
m  x  m  matrix  with  i.i.d.  copies  of  £  in  its  entries.  Then  there  exists  a  constant 
I\  >  0  such  that 

(4.17)  P  (49)(^)  <  KrrX\  <  cm 

for  some  0  <  c  <  1,  where  K  only  depends  on  e  and  the  random  variable  £. 


5.  The  lower  tail  probability  of  the  smallest  ^-singular  value 

In  this  section,  we  first  study  the  probability  estimates  of  the  smallest  (/-singular 
value  of  rectangular  random  matrices  with  m  >  n.  Then  we  give  some  estimates 
for  square  random  matrices. 


5.1.  The  tall  random  matrix  case.  In  this  subsection,  we  assume  that  n  <  Am 
with  A  G  (0, 1)  and  consider  the  smallest  (/-singular  value  of  random  matrices  of 
size  m  x  n. 

Theorem  5.1.  Given  any  0  <  q  <  1,  let  f  be  the  pre- Gaussian  random  variable 
with  variance  1  and  A  be  an  m  x  n  matrix  with  i.  i.  d.  copies  of  £  in  its  entries. 
Then  there  exist  some  7  >  0,  b  >  0  and  v  G  (0, 1)  dependent  on  the  pre-Gaussian 
random  variable  f  such  that 

(5.1)  P  (slf\A)  <  7m1/9)  <  e~bm 

with  n  <  vm. 

To  prove  this  result,  we  need  to  establish  a  few  lemmas. 

Lemma  5.1.  Fix  any  0  <  q  <  1.  For  any  ■■■,  that  are  i.i.d.  copies  of 
a  pre-Gaussian  variable  with  non-zero  variance,  for  any  c  G  (0,1)  there  exists 
A  G  (0, 1),  that  does  not  depend  on  m,  such  that 

(771 

i&r  < Xm 

k= 1 

Proof.  For  any  £1,  •  •  • ,  frn  that  are  i.i.d.  copies  of  a  pre-Gaussian  variable  with 
non-zero  variance,  we  know  that  there  exists  some  6  >  0  such  that 

(5.3)  £0  ■=  P (|£fc|  <  <5)  <  1 
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for  k  =  1,  2,  •  •  •  ,  m,  because  otherwise  the  pre-Gaussian  variable  would  have  a  zero 
variance.  Then  using  the  Riemann-Stieltjes  integral  for  expectation,  we  have 

OO 

Eexp^-^^  =  ^ exp  dP  (\£k\  <  t) 

0 

S  oo 

<  j  cflP  (ICfel  <t)  +  j  exp  dip  (161  <  t) 

0  5 

OO 

=  £0  +  1"  exp  ^(161  <  t) . 

8 

Choose  A  >  0  to  be  small  enough  such  that 

(  t*\  f  <5«\  £0 


for  t  >  6.  Therefore,  it  follows  that 

OO 

Eexp  <  £0  +  2^6)  IdF(l(kl  - t]- +  f  =  h- 

8 

Finally,  applying  Markov’s  inequality,  we  obtain 


X  i^i9  <  Xm 


\k=l 


=  P  ^exp  -  j  X  >  1 
<  E  f^exp  (m  —  ^  X  161^  ^ 


k= 1 


16 


=  emJjEexp 
k= 1 

<  (3ee0/2  )"*. 

For  any  c  €  (0, 1),  we  choose  eo  such  that  3eeo/2  =  c.  This  completes  the  proof.  □ 


The  following  lemma  is  a  property  of  the  linear  combination  of  pre-Gaussian 
variables,  which  allows  us  to  obtain  the  probabilistic  estimate  on  ||^4u||  for  the 
pre-Gaussian  ensemble  A. 


Lemma  5.2  (Linear  combination  of  pre-Gaussian  variables).  Let  ajj,  i  =  1,  2, 

n 


■  ■  ■ ,  m  and  j  =  1,  2, . . . ,  n  be  pre-Gaussian  variables  and  r]i  = 
are  pre-Gaussian  variables  for  i  =  1,2,  . . . ,  m. 


y  ^  aijxj  ■ 
j= 1 


Then  ip 


Proof.  Since  a-i-j  are  pre-Gaussian  variables,  Ea^  =  0,  and  there  are  constants 
A ij  >  0  such  that  E  \a.ij\k  <  k\ Af  ■  for  i  =  1,  2, . . . ,  m  and  j  =  1,2,...,  AT.  It  is  easy 
to  see 

N 

E?7  i  =  XjEaij  =  0. 

3= 1 
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Letting  ||x||i  =  YliLi  \xj\i  we  use  the  convexity  to  have 


<  \\x\\$  V'  (\dij\)k  <  A: ! 1 1 s 1 1 1  (max{ A;. j } ) k 

lA  1,1,1 1 

for  all  integers  k  >  1.  Thus,  r/k  is  a  pre-Gaussian  random  variable.  □ 

Combining  two  lemmas  above,  we  obtain  the  following 

Lemma  5.3.  Given  any  0  <  q  <  1  and  letting  A  be  an  m  x  n  pre-Gaussian 
matrix,  for  any  c  G  (0, 1)  there  exists  A  G  (0, 1)  such  that 

(5.4)  P  (\\Av\\q  <  A  1/qm1/q)  <  cm 

for  each  v  G  Eq,  where  Eq  is  the  (n  —  1)- dimensional  unit  sphere  in  the  tq- 
quasinorm. 

We  are  now  ready  to  prove  Theorem  5.1. 

Proof  of  Theorem  5.1.  By  using  Lemma  2.6,  for  any  5  >  0  there  exists  a  d-net  Uq 
in  unit  sphere  Eq  such  that 


min  liar  —  u\\q  <  6  for  all  x  G  Sa  and  card(Z4„)  <  (  1  +  -  ) 
ueuq  q  q  V  8 ) 

By  Lemma  5.3,  for  all  v  G  Uq  we  have 


n/q 


(5.5)  P  (||At/||9  <  Am,  for  all  v  G  Uq )  <  ^1  +  ^ 


n/q 


Since 

the 

event  s 

p(49)' 

If  v  G 

Uq, 

we  use 

(5.6) 

If  V 

Uq, 

we  use 

P 

(m»ii. 

< 

e~ClTn  - 

(49)(^)  <  7 m1/q)  <  ^1  +  ^ 


<  27m1/9  for  some  v  G  Eg) 
ive 

/<? 


+  P  fs^(Al)  <  Km l!q  and  ||Ai;||9  <  2y with  v  G  Eq\U^j  . 

When  v  G  S q\Uq  in  the  event  that  s^(A)  <  Kml/q  and  ||Au||?  <  27m1/9,  there 
exists  a  u  G  Uq  within  a  ^-distance  8  such  that 

\\Au\\l  <  ||A(W-u)||;  +  ||Af;||; 

<  (s[q\A 

<  Kqm8  +  (27)  9m. 

<  A  qm 


||n  —  u\\q  +  ||  Av 
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A q  -  (2y)2 

if  8  <  - — - .  We  can  use  (5.5)  again  to  conclude 


(5.7) 


K<i 


P  ^s^(T)  <  Km}tq  and  ||Au||9  <  27 m1^  for  some  v  €  S q\U^j  <  ^1  +  ^ 
If  we  choose  v  and  c  small  enough  in  Lemma  5.1  with  n  =  vrri  such  that 


n/q 


c2  :=(!  +  - 


v/q 


C  <  1, 


we  have  thus  completed  the  proof  by  choosing  b  >  0  such  that  e  Cim  +  e  02 m  < 
e~bm.  □ 


5.2.  The  square  random  matrix  case.  Now  let  us  consider  the  square  random 
matrices  with  pre-Gaussian  entries. 


Theorem  5.2.  Given  any  0  <  q  <  1,  let  £  be  the  pre-Gaussian  random  variable 
with  variance  1  and  A  be  an  n  x  n  matrix  with  i.  i.  d.  copies  of  £  in  its  entries. 
Then  for  any  e  >  0  and  0  <  q  <  1,  there  exist  some  K  >  0  and  c  >  0  dependent 
on  e  and  the  pre-Gaussian  random  variable  f  such  that 

(5.8)  p(49)(H)  <en“«)  <  Ce  +  Can  +  P  (\\A\\  >  Kn~*\  , 

where  a  €  (0, 1)  and  C  >  0  depend  only  on  the  pre-Gaussian  variable  and  K. 


To  prove  the  above  theorem,  we  generalize  the  ideas  in  Rudelson  and  Ver- 
shynin’2008,  [15]  to  the  setting  of  the  7,,,-quasinorm.  We  first  decompose  S^-1 
into  the  set  of  compressible  vectors  and  the  set  of  incompressible  vectors.  The  con¬ 
cepts  of  compressible  and  incompressible  vectors  in  SI)-1  were  introduced  in  [15]. 
See  also  Tao  and  Vu’2009,  [27].  We  shall  use  a  generalized  version  of  these  concepts. 
Recall  that  ||x||o  denotes  the  number  of  nonzero  entries  of  the  vector  x  G  Mn. 

Definition  5.1  (Compressible  and  incompressible  vectors  in  §”-1).  Fix  p,  A  G 
(0, 1).  Let  Compq  (A ,  p)  be  the  set  of  vectors  v  €  Sg_1  such  that  there  is  a  vector 
v'  with  ||u'||0  <  An  satisfying  ||u  —  v'\\q  <  p.  The  set  of  incompressible  vectors  is 
defined  as 

(5.9)  Incompq  (A,  p)  :=  §”-1  \  Compq  (A,  p)  . 


Now  using  the  decomposition  in  Definition  5.1,  we  have 

P  (s{n\A)  <  £n~*)  <  P  (infveComPq(\,p)  \\Av\\q  < 

(5. 10)  /  _  i\ 

TP  (^inf V£incornpq(\,p)  1 1  A.v 1 1 ^  <  en  q  J  . 

In  the  following  we  are  going  to  consider  each  of  the  two  terms  on  the  right  hand 
side  of  the  above  equation.  For  the  first  term  on  compressible  vectors,  we  can  apply 
Lemma  5.3  since 


(5.11)  P  (  inf  \\Av\\  <  en  q  )  <  P  (  inf  ||Ay||  <  vnq 

\y(zIncompq(\,p )  y  J  \v€zIncompq(\,p )  y 

to  conclude  that  it  actually  decays  exponentially  for  n  >  1,  where  v  =  A1/9  as  in 
Lemma  5.3. 

However,  for  incompressible  vectors,  we  first  consider  distq  ( Xj,Hj ),  which  de¬ 
notes  the  distance  between  column  Xj  of  an  n  x  n  random  matrix  A  and  the  span  of 
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other  columns  Hj  :=  span  (X\,  •  •  •  ,  Xj_i,Xj+ 1, . . . ,  Xn)  in  the  t!(/-quasinorm.  We 
generalize  a  result  on  the  probability  estimate  of  the  distance  in  the  £2-norm  in  [15] 
to  the  £g-quasinorm  setting.  This  allows  us  to  transform  the  probabilistic  estimate 
on  \\Av\\  for  v  €  Incompq  (A,  p)  to  the  probabilistic  estimate  on  the  average  of  the 
distances  distq  (Xj,  Hj),  j  =  1,2, ,  n. 

Lemma  5.4.  Let  A  be  an  n  x  n  random  matrix  with  columns  X\,  . . Xn,  and 
let 

Hj  .  span  i  -A  i .  ■  ■  ■  ,  X j  _  i .  Xj  ^ •  ,  Xn )  . 

Then  for  any  p,  A  G  (0, 1)  and  e  >  0,  one  has 

(5.12)  P  (  inf  \\Av\\  <£pn~*  \  <  —  P  (distq  (Xj,  Hj)  <  e) , 

\v£lncompq(A,p)  q  J  All  . — ' 

in  which  distq  is  the  distance  defined  by  the  lq-quasinorm. 

Proof.  For  every  v  G  Incompq  (A,  p),  by  Definition  5.1,  there  are  at  least  An  com- 
ponents  of  v,  Vj,  satisfying  | Vj\  >  pn  q ,  because  otherwise,  v  would  be  within 

£g-distance  p  of  the  sparse  vector,  the  restriction  of  v  on  the  components  Vj  satis- 
_  1 

fying  \Vj\  >  pn  q  with  sparsity  less  than  An,  and  thus  v  would  be  compressible. 
Thus  if  we  let  T\  (v)  :=  |  j  :  \v3  \  >  pn~q  j,  then  \T\  (u)|  >  An- 

Next,  let  X2  (A)  ■.=  {j  :  distq  (Xj,Hj)  <  e}  and  £  be  the  event  such  that  for  the 
cardinality  of  X2  (vl),  |X2  (yl)|  >  An.  Applying  Markov’s  inequality,  we  have 

P(£)  = 

< 


Since  £c  is  the  event  such  that 

| {j  ■  dist,q  {Xj,  Hj)  >  e}|  >  (1  -  A)  n 
for  random  matrix  A,  if  £c  occurs,  then  for  every  v  G  Incompq  (A,  p), 

|Ii  {v)  |  +  |Z2  {A)  |  >  An  +  (1  -  A)  n  =  n. 

Hence  there  is  some  jo  £  Zi  (v)  nZ2  {A).  So  we  have 
\\Av\\q  >  distq  ( Av,Hjo )  =  distq  (vjoXjo ,Hjo)  =  \vjo\distq  (Xjo,Hjo)  >  epn~\ . 

If  the  events  |[Au||  <  epn~q  occur,  then  £  also  occurs.  Thus 

r  CeI„“.(A,P,  <  -  P<£)  -  <  s) . 

These  complete  the  proof.  □ 

Note  that  distq  ( Xj,Hj )  >  dist(Xj,  Hj)  because  ||-||  >  |j-||2.  Thus  we  can  take 
the  advantage  of  the  estimate  on  P  (dist  {Xj,  Hj)  <  s)  given  in  [15]  to  obtain  the 
estimate  on  P  {distq  {Xj,Hj)  <  e). 


P({Z2(A)  :  |Z2  {A,  e)|  >  An}) 
^-E|Z2(A)| 

An 

{j  :  distq  {Xj,  Hj)  <  e} 


An 


P  {distq  {Xj ,  Hj 


<  e) 


3  =  1 
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Theorem  5.3  (Distance  bound  (cf.  [15])).  Let  A  be  a  random  matrix  whose 
entries  are  independent  variables  with  variance  at  least  1  and  fourth  moment 
bounded  by  B.  Let  K  >  1.  Then  for  every  s  >  0, 

(5.13)  P  (dist(Xj,Hj)  <e  and  ||A||  <Kn~2\  <  C  (e  +  an) , 

where  a  €  (0, 1)  and  C  >  0  depend  only  on  B  and  K. 

The  above  theorem  implies  that 
(5-14) 

P  (distq  <  e)  <  P  (dist  (Xj,Hj)  <  e)  <  C  (e  +  an)  +  P  (\\A\\  <  Kn~ 0  . 

Combining  (5.10)  and  applying  Lemma  5.4,  we  now  reach  the  desired  inequality  in 
Theorem  5.2. 

Furthermore,  since  A  is  pre-Gaussian,  using  a  standard  concentration  bound  we 
know  that  for  every  e  >  0  there  exists  some  K  >  0  such  that  P  f||A||  <  Kn~^\  <  e. 
Thus,  we  have  proved  Theorem  1.6. 


6.  The  upper  tail  probability  of  the  smallest  ^-singular  value 


In  this  section,  we  continue  to  study  the  estimate  of  the  upper  tail  probability 
of  the  smallest  g-singular  value  of  an  n  x  n  pre-Gaussian  matrix.  Mainly  we  are 
going  to  prove  Theorem  1.7.  To  do  so,  we  need  some  preparation. 

Let  Xj  be  the  ji-th  column  vector  of  A  and  7 Tj  be  the  projection  onto  the  subspace 
Hj  :=  span  (X\, . . . ,  Xj-i,Xj+\,  •  •  •  ,  Xn).  We  first  have 

Lemma  6.1.  For  every  a  >  0,  one  has 

(6.1)  P  (j| Xj  -  7 Tj  (Xj)\\q  >  <  cie_C2"  +  c3n_C4 

for  each  j  =  1,  2,  . . . ,  n,  where  C\ ,  C2 ,  c3 ,  C4  >  0  are  constants  independent  of  j,  n, 
and  q. 


Proof.  Without  loss  of  generality,  assume  j  =  1.  Write  (ai,  «2,  . . . ,  an)  :=  X\  — 
7Ti  (Xi).  Applying  the  Bessy- Esseen  theorem  (see  for  instance  [21]),  we  know  that 


(6.2)  F(\\Xj-7rj(Xj)\\2>a)  = 


E”=i  <kti 


>  a  j  =  P  (|g|  >a)  +  0(n  c) 


for  some  c  >  0,  where  g  is  a  standard  normal  random  variable. 
By  the  Holder  inequality, 


i-g 


Xj  ^  t Tj  (XM  <  n~  \\Xj  -  7 tj  (Xj) ]],  <  n*-*  \\Xj  -  7 Tj  (. X j) 


It  follows  that 


Xj  ~  nj  (Xj)\\q  >  ni  2a)  <  P(n9  2  \\Xj  —  TTj  (Xj)||2  >  n«  2a 


=  P  —  TTj  (Xj)||2  >  a)  . 

Therefore  it  follows  from  (6.2)  that 

P  —  TTj  (Xj)\\q  >  <  P(|g|  >  a)  +  0(n~c) 

/>00 

<  c,e_°"  +  c3n-C4 

for  some  positive  constants  Ci,C2,c3,  C4. 


□ 
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We  are  now  ready  to  prove  Theorem  1.7. 


Proof  of  Theorem  1.7.  From  Section  5.2  and  by  Lemma  2.4,  we  know  that  the  rex  re 
pre-Gaussian  matrix  A  is  invertible  with  very  high  probability.  Thus,  we  have 
(6.3) 


p^49)4)  <  (T-n~1/q^j  >p(l! 

Thus  it  suffices  to  show  that 


v\\q  <  a,  ||A  1n||(?  >  j  ■  re1/9  for  some  v  € 


(6.4) 


v\\q  <  a, \\A  ^4  >  j  ■  n1/q'j  >  1  -  e 


for  some  vector  v  0. 

Using  the  result  established  in  Rudelson  and  Vershynin’2008,  [14] ,  we  can  easily 
get  the  desired  probability  of  the  event  that  ||  1v||  <  |  -re«  occurs.  Indeed,  since 

|4_1t/||  >  ||A_1'i/||  ,  we  know  that 


(6.5) 


A  1n  < 


—  t 


-V?)  < 


(14  1n||2  <  |  •  re1/9) 

=  P(’|4-1n||2  <  f  •  (re2/9)172) 

<  2p  (4e,  t,  re2/9)  , 

where  p  (e,  t,  re)  :=  C5  4  +  e~C6t  +  e~C7n)  for  some  positive  constants  C5,  C6,  C7. 

Next  let  us  choose  v  =  X\  —  tt\  (Xi).  Lemma  6.1  together  with  the  estimate  in 
(6.5)  yield  (6.4).  Indeed,  letting  u  =  t  =  y/ln  M  with  M  >  1  and  s  =  4j,  we  have 


(6.6) 


P  4^4)  >  MlnM  •  re  1/2^  < 


C 


+  c" 


1LC 


for  some  (7  >  0,  0  <  c  <  1,  and  a  >  0.  Then  choosing  K  :=  MlnM,  we  have 

,C#>(A)  >  Ku-rA  <  _  CdnKf 

\  n  \  J  j  -  K<x  _ 

if  M  >  e,  which  requires  K  >  e.  These  complete  the  proof. 
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1  Introduction 

An  understanding  of  physiological  time  series  such  as  the  heart-beat  inter¬ 
vals  is  important  to  many  areas,  like  heart-attack  prediction,  cardiovascular 
health,  sport  and  exercise,  etc.  The  study  of  time  series  can  reveal  underlying 
mechanisms  of  the  physiological  system,  which  usually  contains  both  deter¬ 
ministic  and  stochastic  components.  Therefore  the  analysis  of  time  series  is 
very  complicated  because  of  the  nonlinear  and  non-stationary  characteristics 
of  physiological  time  series  data.  Over  the  past  years,  time  series  analysis 
methods  are  applied  to  quantify  physiological  data  for  identification  and  clas¬ 
sification  (see  [7,  12]).  The  application  of  physiological  time  series  analysis 
commonly  focus  on  measuring  different  aspects  of  time  series  data  such  as 
complexity,  regularity,  predictability,  dimensionality,  randomness,  self  sim¬ 
ilarity,  etc.  The  tools  used  in  these  techniques  include  but  not  restrict  to 
the  mean,  standard  deviation,  Fourier  transform,  wavelet,  entropy,  fractal 
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dimension,  pattern  detection  (see  [8,  13]). 

Recently  a  new  mathematical  tool,  empirical  mode  decomposition  (EMD), 
was  proposed  by  Norden  Huang  et  al  (see  [5,  6]).  It  decomposes  a  time  se¬ 
ries  into  a  finite  sum  of  intrinsic  mode  functions  (IMF)  that  generally  admit 
well-behaved  Hilbert  transforms.  This  decomposition  is  based  on  the  local 
characteristic  time  scale  of  the  data,  which  makes  EMD  applicable  to  ana¬ 
lyze  nonlinear  and  non-stationary  signals.  EMD  and  Hilbert  transform  to¬ 
gether,  called  the  Hilbert-Huang  transform  (HHT),  usually  allow  to  construct 
meaningful  time-frequency  representations  of  signals  using  instantaneous  fre¬ 
quency  of  the  data,  EMD  and  HHT  have  been  applied  with  great  success 
in  many  application  areas  such  as  biological  and  medical  sciences,  geology, 
astronomy,  engineering,  and  others  (see  [5,  1,  3,  6,  11,  10]).  Another  interest¬ 
ing  set  of  examples  is  the  work  of  L.Yang,  who  has  successfully  applied  EMD 
based  techniques  for  texture  analysis  and  Chinese  handwriting  recognition 
(see  [16,  17,  15,  18]). 

The  main  purpose  of  this  paper  is  to  develop  a  new  approach  for  the 
analysis  of  physiological  times  series.  Our  approach  is  motivated  by  two 
intuitions  and  coupled  with  modern  machine  learning  techniques.  The  first 
intuition  comes  from  a  belief  that  a  physiological  system  should  contain  a  de¬ 
terministic  part  that  reflects  the  basic  mechanism  for  the  system  to  survive 
and  a  stochastic  part  that  represents  the  variability  of  resilience.  Mathe¬ 
matically  they  can  be  represented  by  the  low  frequency  and  high  frequency 
components  of  a  physiological  signal.  This  motivates  the  application  of  meth¬ 
ods  of  decomposing  signals  into  various  components  according  to  frequencies 
in  the  quantitative  analysis  of  physiological  time  series.  Examples  include 
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the  Fourier  transform,  wavelets,  EMD.  In  our  method  we  will  use  an  itera¬ 
tive  convolution  filter  which  is  an  alternative  of  EMD.  The  second  intuitions 
comes  from  a  statistical  perspective  of  irregularity.  A  lot  of  study  has  proved 
that  normal  physiological  systems  show  irregularity  due  to  the  existence  of 
stochastic  components  while  the  decrease  of  irregularity  usually  imply  the 
abnormality.  From  statistical  perspective,  irregularity  of  a  data  set  is  repre¬ 
sented  by  the  “outliers” .  This  motivates  us  to  study  the  statistics  of  outliers 
in  physiological  time  series.  However,  we  must  be  careful  in  dong  so.  Prac¬ 
tical  physiological  times  series  usually  contains  noise  which  may  also  appear 
as  outliers.  We  have  to  guarantee  the  “outliers”  we  examined  are  not  pure 
noise.  This  is  possible  because  true  outliers  do  not  have  informative  struc¬ 
tures  and  could  be  detected.  The  second  intuition  is  the  motivation  for  our 
feature  construction  in  section  2.2. 

These  two  intuitions  enable  us  to  decompose  the  physiological  times  series 
and  construct  features  for  our  quantitative  analysis.  Combining  with  the  well 
established  feature  selection  techniques  in  machine  learning  we  can  remove 
the  redundancy  of  the  features  and  find  relevant  statistics  for  classification  of 
physiological  time  series.  SVM-RFE  (Support  Vector  MachineRecursive  Fea¬ 
ture  Elimination)  is  suggested  in  this  paper  for  linear  classification  problems. 
The  details  of  our  approach  will  be  described  in  Section  2. 

We  will  use  our  approach  to  the  study  of  congestive  heart  failure  problems. 
The  purposes  is  two-fold:  The  Erst  is  to  build  good  classifier  to  enable  good 
diagnosis.  The  second  is  to  hnd  what  kind  of  irregularity  is  related  to  the 
heart  health.  The  results  and  discussions  are  summarized  in  Section  3. 

The  novelty  of  our  method  is  mainly  the  following  two  points.  Firstly, 
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although  we  decompose  the  time  series  into  components  of  different  frequen¬ 
cies,  we  do  not  compare  them  from  the  frequency  domain.  Secondly,  we 
proved  that  the  outliers  in  a  physiological  time  series  are  usually  not  true 
outliers  but  are  informative  instead. 


2  Method 

2.1  Signal  decomposition 

Let  L  be  a  low  pass  Liter.  Denote  by  T  the  weak  limit  of  the  the  operator 
(/  —  L)n  as  n  — *  oo,  i.e. ,  for  a  discrete  signal  A"  and  time  t 

T(X)(t)  —  lim  (/  —  L)n(X)(t). 

n— >oc 

LIsing  this  operator  iteratively,  a  signal  X  can  be  decomposed  as  follows:  Let 
F\  =  T( X)  and  for  k  >  2, 

Fk  —  T  ^  . 

After  m  steps  we  get  F\ , . . . ,  Frn  which  we  call  mode  functions  and  the  resid¬ 
ual 

m 

r  =  x~Yjf1. 

i=  1 

Then  we  have 

A  =  F\  +  F2  +  . . .  +  Fm  +  R. 

In  this  decomposition,  roughly  speaking  the  former  mode  functions  are  noise 
or  high  frequency  components  and  the  latter  mode  functions  are  low  fre¬ 
quency  components  and  R  is  the  trend. 
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This  procedure  follows  the  spirit  of  the  traditional  EMD  introduced  in 
[5].  In  the  traditional  EMD,  the  low  pass  filter  L  is  chosen  as  the  average 
of  the  upper  envelope  (the  cubic  spline  connecting  the  local  maxima)  and 
the  lower  envelope  (the  cubic  spline  connecting  the  local  minima).  This 
method,  although  has  been  successfully  used  in  many  applications,  is  lack  of 
theoretical  foundation  and  has  its  limitations. 

In  [9]  a  new  approach  is  proposed.  In  this  new  approach  the  low  pass 
filter  is  a  moving  average  generated  by  a  mask  a  =  (aj)^=_N  that  gives  the 
L(X)  as  the  convolution  of  a  and  X,  i.e., 

N 

L(X)(t)  —  ajXU  +  t). 

j=-N 

With  this  choice  of  L  we  call  the  operator  T  an  iterative  convolution  filter. 
A  rigorous  mathematical  foundation  and  convergence  analysis  is  in  [9,  14]. 
Note  the  mask  a  is  finitely  supported  on  [-N,  N]  and  N  is  called  the  window 
size.  The  flexibility  to  choose  the  window  size  is  crucial  in  applications  and 
forms  a  main  advantage  of  this  method. 

Similar  to  decompositions  by  many  other  methods  like  Fourier  transform 
and  wavelets  ,  the  trend  and  low  frequency  components  are  usually  assumed 
to  characterize  the  profile  of  the  signal  and  the  high  frequency  components 
characterize  the  details.  In  different  applications  we  need  the  features  of 
difference  components. 

2.2  Feature  extraction 

After  decomposing  the  signal  into  the  mode  functions  and  the  trend,  we 
need  to  extract  statistics  that  can  characterize  the  essential  features  of  these 
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components.  This  step  requires  a  priori  knowledge  of  the  problem  under 
consideration.  It  could  be  rather  weak.  But  without  any  priori  knowledge, 
it  is  difficult  to  get  proper  statistics.  Also,  this  step  is  strongly  problem 
dependent.  In  the  following  let  us  use  the  heart-beat  intervals  as  an  example 
to  illustrate  how  to  construct  the  features. 

For  each  mode  function  Fl}  we  first  get  its  mean  m,  and  standard  deviation 
a j.  By  the  previous  studies  [2]  the  healthy  heart  beats  more  irregularly  than 
the  unhealthy  heart.  This  motivates  us  to  design  the  statistics  to  measure  the 
irregularity.  To  this  end,  we  consider  the  terms  that  are  larger  than  m  +  a 
and  find  their  mean  and  standard  deviation.  We  also  find  the  mean  and 
standard  deviation  of  the  terms  that  are  larger  than  m  +  2cr.  Symmetrically 
we  also  get  the  mean  and  standard  deviation  of  those  terms  that  are  smaller 
than  m  —  a  and  m  —  2cr.  Note  all  these  terms  are  in  some  sense  “outliers” 
and  it  is  natural  to  use  the  statistics  of  the  outliers  as  the  characterization 
of  the  irregularity. 

Next  we  consider  the  local  maxima  and  local  minima  of  Ft.  These  two 
series  measure  the  local  upper  amplitude.  For  each  series  we  consider  the  ten 
statistics  as  those  for  Ft. 

Therefore  for  each  component  we  get  30  statistics. 

Unlike  in  [2],  we  use  the  whole  24-hour  heart  beat  time  series  and  as¬ 
sume  we  do  not  know  the  periods  for  different  activities  such  as  sleeping  and 
walking.  We  think  the  statistics  for  different  periods  should  be  different  and 
not  all  of  them  represent  the  difference  between  the  healthy  and  unhealthy 
people.  This  motivates  the  idea  of  split  the  whole  time  series  into  subseries. 
Suppose  we  have  K  subseries  for  each  patient.  Then  we  get  K  subcompo- 

6 

228 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


nents  for  each  mode  function  which  will  be  denoted  by  Ft]1  j  =  1, . . . ,  K.  For 
each  subcomponent  FVJ  we  also  get  the  30  statistics  as  for  Ft.  For  the  same 
statistics,  we  have  K  values  from  the  K  subcomponents.  We  compute  the 
mean  of  all  values,  the  mean  of  lower  half  and  upper  half,  respectively.  This 
gives  90  statistics  as  summary.  So  for  each  component  we  get  120  statistics. 

For  physiological  signals,  we  believe  the  trend  and  low  frequency  com¬ 
ponents  are  determined  by  the  fundamental  mechanism  while  the  individual 
differences  should  be  reflected  by  the  high  frequency  components.  In  case 
that  we  do  not  have  much  knowledge  about  the  disease  to  be  diagnosed  we 
may  assume  the  features  may  also  comes  from  the  trend.  So  the  same  120 
statistics  are  also  computed  for  the  trend  component. 

2.3  Feature  subset  selection 

After  the  above  two  steps  we  have  get  many  features  for  the  data.  Usually 
only  a  small  part  of  them  are  related  to  the  diagnosis  and  the  physiological 
mechanism  of  the  disease.  The  task  of  the  third  step  is  to  find  the  relevant 
ones.  This  will  be  realized  by  eliminating  the  irrelevant  ones  step  by  step. 

Firstly,  if  a  statistic  is  almost  constant,  then  it  is  useless  in  the  diagnosis 
and  should  be  eliminated.  For  example,  the  means  of  the  mode  functions  rnt 
are  all  approximately  zero  and  should  be  eliminated. 

Next  we  use  the  SVM-RFE  method  [4]  to  rank  the  features.  In  this 
method,  given  a  set  of  training  samples,  we  first  train  linear  SVM  to  get 
a  classifier  and  then  rank  the  features  according  to  the  weights.  Because 
of  large  feature  size  and  small  training  samples,  the  classifier  might  not  be 
as  good.  Also,  the  high  correlation  between  the  features  may  result  the 
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relevant  features  to  have  small  weights.  These  reasons  could  lead  the  rank  to 
be  inaccurate.  In  order  to  refine  the  rank  we  eliminate  the  least  important 
feature  and  repeat  the  process  to  re-rank  the  remained  features.  Running 
this  process  iteratively  we  finally  get  the  refined  rank  of  the  features. 

With  this  rank  of  features  we  can  conclude  which  statistics  are  useful 
for  the  diagnosis  and  characterize  the  essence  of  the  underlying  physiological 
mechanism.  Good  classifiers  can  then  be  built  to  make  accurate  diagnosis. 

3  Experiments  and  Results 

In  this  section  we  apply  our  new  method  described  in  Section  2  to  the  hear 
beat  interval  times  series  and  report  our  results  and  conclusions. 

3.1  The  data  set 

The  data  set  includes  the  heart  beat  interval  time  series  of  72  healthy  people 
and  43  CHF  patients.  For  each  people  the  heart  beat  interval  is  measured 
for  24  hours  under  various  activities.  In  our  experiment  we  will  assume  the 
activity  period  is  not  known.  The  average  ages  of  these  two  groups  are  both 
55  years.  The  standard  deviation  of  age  of  CHF  patients  is  11  years  and 
which  of  healthy  people  is  16  years.  If  divide  CHF  patients  into  4  degrees 
where  the  degree  I  is  a  slight  CHF  and  the  degree  IV  is  a  severe  CHF,  most 
CHF  patients  are  of  the  degree  III. 
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Figure  1:  The  mean  and  variance  (in  second)  of  the  times  series,  ’o’  for 
healthy  people  and  for  CHF  patients. 

3.2  A  primary  study 

Before  using  our  new  method,  we  study  the  classification  ability  of  two  simple 
statistics:  mean  and  variance.  In  Figure  1  we  plot  the  mean  and  variance 
of  the  heart  beat  intervals  for  the  healthy  people  and  CHF  patients.  We  see 
that  the  healthy  people  and  the  CHF  patients  can  be  roughly  separated.  The 
average  heart  beat  interval  of  healthy  people  is  larger  and  so  is  the  variance. 
It  shows  the  heart  of  healthy  people  beats  slower  and  more  irregularly.  This 
observation  coincides  with  the  previous  study. 

At  the  same  time,  we  notice  that  several  cases  falling  into  the  healthy 
people  show  to  be  severe  CHF  patients.  So  we  conjecture  that  the  mean  and 
variance  might  not  reflect  the  essence  of  the  underlying  mechanism,  although 

9 


231 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


they  have  good  separability. 


3.3  Experiment:  feature  extraction 

For  each  time  series,  we  use  the  iterative  convolution  filter  to  realize  the 
signal  decomposition.  In  this  step  we  need  to  specify  the  window  size  of  the 
mask.  It  turns  out  it  should  be  chosen  between  50  and  100  to  be  stable.  In 
onr  experiment  it  is  chosen  to  be  50. 

We  then  calculate  the  statistics  proposed  in  Subsection  2.2.  Here  we  need 
to  specify  the  parameter  K ,  the  number  of  subseries.  If  a  statistic  really  cap¬ 
tures  the  essence  of  the  data  set,  it  should  be  stable  and  independent  of  the 
choice  of  K  once  it  is  chosen  within  a  reasonable  interval.  Our  experiments 
show  that  K  =  50  is  a  good  choice.  Most  heart  beat  signals  were  recorded 
for  a  little  bit  more  than  24  hours.  Thus  when  K  =  50,  each  subseries  is 
around  30  minutes  of  record. 

Previous  studies  have  shown  that  healthy  heart  beats  irregularly.  In 
statistics,  irregularity  could  be  measured  by  statistics  of  “outliers”  that  are 
not  due  to  noise.  This  motivates  us  to  consider  the  upper  half  mean  and 
the  lower  half  mean  of  the  fluctuations.  At  the  same  time,  from  the  study  in 
Section  3.2  we  find  that  a  healthy  heart  beats  slower  than  an  unhealthy  heart 
in  average.  These  two  intuitions  enlighten  us  to  conjecture  that  those  larger 
heart  beat  intervals  (i.e.  slower  heart  beats)  in  the  times  series  characterize 
the  difference  between  the  healthy  people  and  CHF  patients.  To  confirm 
this,  we  do  a  correlation  analysis. 

For  the  first  two  IMFs  of  the  50  components  of  each  time  series,  we 
calculate  and  sort  the  mean  and  standard  deviation  of  those  terms  larger 

10 


232 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


than  mean  plus  standard  deviation  and  those  terms  larger  than  mean  plus 
two  times  standard  deviation.  For  each  statistic  we  compute  its  correlation 
to  the  CHF  disease.  The  result  is  plotted  in  Figure  2  in  red  color.  We 
compute  the  same  indices  for  those  items  smaller  than  the  mean  minus  one 
and  two  times  standard  deviation.  The  result  is  plotted  in  Figure  2  in  blue 
color.  From  the  comparison  we  see  that,  in  average,  correlations  of  the 
statistics  associated  to  the  larger  fluctuations  are  larger  and  the  upper  half 
mean  of  these  statistics  are  stable.  This  observation  motives  us  to  disregard 
the  smaller  fluctuations  and  the  statistics  for  those. 

3.4  Feature  ranking  and  subset  selection 

To  rank  the  features,  we  randomly  split  the  data  set  into  two  subsets  as 
the  training  set  and  the  test  set,  respectively.  In  the  training  set  we  have  50 
healthy  subjects  and  30  CHF  subjects  and  in  the  test  set  there  are  22  healthy 
and  13  CHF  subjects.  We  use  the  training  set  to  build  the  SVM  classifier 
and  use  the  test  set  to  control  the  accuracy.  Using  the  SVM-RFE  methods 
described  in  Subsection  2.3  we  rank  the  features.  To  guarantee  the  stability 
of  the  rank  we  repeat  this  procedure  1000  times  and  choose  the  statistics 
that  appear  most  frequently  in  the  model. 

In  all  1000  repeats,  the  classification  error  on  the  test  data  set  is  summa¬ 
rized  in  the  following  table: 


number  of  errors 

0 

1 

2 

3 

4 

5 

number  of  repeats 

823 

116 

42 

14 

4 

1 

11 
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Figure  2:  The  correlations  of  various  statistics  to  the  CHF  disease.  The 
first  column  is  for  the  first  IMF  and  the  second  column  is  for  the  second 
IMF.  The  first  line  is  for  the  mean  of  those  items  larger  than  the  mean 
plus  standard  deviation  (red  line)  and  those  items  smaller  than  the  mean 
minus  the  standard  deviation  (blue  line).  The  second  line  is  for  the  standard 
deviation  of  two  types  items.  The  third  line  is  for  the  mean  of  those  items 
larger  than  the  mean  plus  2  times  standard  deviation  (red  line)  and  those 
items  smaller  than  the  mean  minus  Ptirnes  standard  deviation  (blue  line). 
The  forth  line  is  for  the  standard  deviation  of  two  types  of  items. 
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We  list  the  top  10  statistics  selected  by  this  procedure: 

1.  IMF  1:  For  the  subseries  consisting  of  local  maxima,  find  all  terms  which 
are  greater  than  the  mean  plus  two  times  standard  deviation,  then  compute 
the  standard  deviation. 

2.  IMF  1:  For  the  subseries  consisting  of  local  maxima,  find  all  terms  which 
are  less  than  the  mean  minus  two  times  standard  deviation,  then  compute 
the  standard  deviation. 

3.  IMF  1:  Equally  divide  the  series  into  K  subseries,  for  each  subseries  find 
all  terms  which  are  less  than  the  mean  minus  two  times  standard  deviation, 
compute  the  standard  deviation,  then  take  the  mean  of  these  K  standard 
deviations. 

4.  IMF  1:  Equally  divide  the  series  into  K  subseries,  find  local  maxima  of 
each  subseries,  find  all  terms  of  local  maxima  which  are  greater  than  the 
mean  plus  two  times  standard  deviation,  compute  the  standard  deviation, 
then  take  the  mean  of  these  K  standard  deviations. 

5.  IMF  1:  Equally  divide  the  series  into  K  subseries,  find  local  minima  of 
each  subseries,  find  all  terms  of  local  minima  which  are  greater  than  the 
mean  plus  two  times  standard  deviation,  compute  the  standard  deviation, 
then  take  the  mean  of  these  K  standard  deviations  . 

6.  IMF  2:  Find  all  terms  which  are  greater  than  the  mean  plus  two  times 
standard  deviation,  then  compute  the  standard  deviation. 

7.  IMF  2:  Equally  divide  the  series  into  K  subseries,  for  each  subseries  find  all 
terms  which  are  greater  than  the  mean  plus  two  times  standard  deviation, 
compute  the  standard  deviation,  then  take  the  mean  of  these  K  standard 
deviations. 
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8.  IMF  2:  Equally  divide  the  series  into  K  subseries,  find  local  maxima  of 
each  subseries,  find  all  terms  of  local  maxima  which  are  greater  than  the 
mean  plus  two  times  standard  deviation,  compute  the  standard  deviation, 
then  take  the  mean  of  these  K  standard  deviations. 

9.  IMF  2:  Equally  divide  the  series  into  K  subseries,  find  local  minima  of 
each  subseries,  find  all  terms  of  local  minima  which  are  less  than  the  mean 
minus  two  times  standard  deviation,  compute  the  standard  deviation,  then 
take  the  mean  of  these  K  standard  deviations. 

10.  Trend:  Equally  divide  the  series  into  K  subseries,  find  local  maxima 
of  each  subseries,  find  all  terms  of  local  minima  which  are  greater  than  the 
mean  plus  standard  deviation,  compute  the  standard  deviation,  then  take 
the  mean  of  these  K  standard  deviations. 

These  10  statistics  that  appear  most  frequently  in  the  model  all  measure 
the  irregularity  of  the  local  amplitude.  Take  Statistics  1  and  Statistics  7  as 
the  example.  They  are  obtained  as  the  following.  To  get  Statistics  1,  for 
the  first  IMF  F\ ,  find  the  local  maxima  u  and  compute  the  mean  m  and 
the  standard  deviation  a  of  u.  Then  we  choose  terms  greater  than  m  +  2a 
and  find  their  standard  deviation.  To  get  Statistics  7,  for  the  subseries  of 
the  second  IMF  F^j  =  1,...  ,  K,  compute  the  mean  rn2j  and  the  standard 
deviation  a2j  of  F-23.  Then  we  choose  terms  greater  than  m2j  +  2a2j  of  F2j 
and  find  their  standard  deviations.  Then  we  compute  the  mean  of  K  such 
standard  deviations.  In  the  following  figure  we  show  the  distribution  of  the 
healthy  people  and  CHF  patients  using  these  two  statistics.  From  this  figure 
it  is  easy  to  see  that  healthy  people  and  CHF  patients  are  well  separated. 

Observing  these  two  statistics,  we  find  that  both  of  them  measure  the 
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Figure  3:  CHF  *  vs  Healthy  o.  The  x-axis  is  Statistics  1  and  the  y-axis  is 
Statistics  7. 

ability  of  the  heart  beat  to  become  extremely  slower  than  usual.  Our  result 
shows  that  the  strong  adaptability  of  extremely  slower  heart  beat  might  be 
the  irregularity  that  characterizes  the  healthy  hearts. 

3.4.1  Reliability  of  the  top  features 

We  have  found  that  the  most  relevant  features  are  statistics  for  the  “outliers” , 
i.e. ,  those  items  larger  than  mean  plus  two  times  standard  deviations,  or  items 
less  than  mean  minus  two  times  standard  deviations  for  IMFs.  A  natural 
question  arises:  “Is  this  accidental?”  This  is  equivalent  to  ask  whether  the 
outliers  taken  into  account  are  noise  or  informative. 

In  order  to  answer  this  question  we  further  analyze  these  outliers.  Firstly 
we  notice  that  the  up  and  down  fluctuations  are  not  balanced  for  both  healthy 
people  and  CHF  patients.  The  percentage  of  items  larger  than  mean  plus  two 
times  standard  deviation  for  healthy  people  is  2.84%  and  those  items  smaller 
than  the  mean  minus  stand  deviation  is  only  2.35%.  For  CHF  patients  the 
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percentages  are  2.49%  and  2.17%,  respectively.  This  observation  is  the  first 
evidence  that  outliers  are  not  due  to  noise  because  otherwise  they  should  be 
balanced  distributed.  Moreover,  recall  for  normal  distribution  the  percentage 
of  one-side  outliers  outside  the  two  times  standard  deviation  is  2.28%.  We 
see  the  outliers  for  CHF  is  closer  to  it  due  to  noise  while  those  for  healthy 
people  are  much  more  and  probably  due  to  not  only  noise  and  hence  are 
informative. 

To  further  confirm  our  conclusion,  we  do  the  following  test:  we  calculate 
the  statistics  for  the  items  larger  than  the  mean  plus  v  times  standard  devia¬ 
tion  with  the  variable  v  changes  from  0  to  2  and  investigate  their  correlation 
to  the  CHF  disease.  Here  we  consider  three  quartile  of  the  50  standard  devia¬ 
tions  of  these  items  in  the  50  components.  The  correlation  is  plotted  in  Figure 
4.  From  this  analysis,  we  see  the  correlation  increases  with  v.  Such  a  trend 
appears  also  in  other  statistics.  This  clear  trend  implies  that  the  relevancy 
between  these  statistics  and  the  CHF  disease  is  not  accidental.  Instead,  we 
should  consider  the  outliers  informative  and  their  properties  characterize  the 
essence  difference  between  healthy  people  and  CHF  patients. 

4  Conclusions  and  discussions 

In  this  paper  we  developed  a  new  approach  for  the  analysis  of  the  physio¬ 
logical  times  series.  The  motivation  comes  from  that  the  physiological  times 
series  usually  contains  both  deterministic  and  stochastic  parts  and  they  can 
be  represented  by  the  low  and  high  frequency  components  of  the  times  se¬ 
ries.  Our  new  method  uses  an  iterative  filter  to  realize  the  decomposition 
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Figure  4:  Corrections  of  the  statistics  described  in  Section  3.4.1  with  v  vary¬ 
ing  from  0  to  2. 

of  the  times  series  into  high  and  low  frequency  components  and  study  their 
statistics.  SVM-RFE  is  then  used  to  select  highly  relevant  features. 

Our  method  is  applied  to  analyze  the  heart  beat  interval  time  series  for 
CHF  disease.  The  top  features  are  found  to  measure  the  ability  of  heart  to 
beat  extremely  slowly.  Healthy  heart  show  strong  ability  which  we  conjecture 
are  due  to  the  strong  resilience  to  the  environment  and  human  activities. 
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This  paper  is  concerned  with  the  question  of  reconstructing  a  vector  in  a  finite- 
dimensional  real  Hilbert  space  when  only  the  magnitudes  of  the  coefficients  of  the 
vector  under  a  redundant  linear  map  are  known.  We  analyze  various  Lipschitz 
bounds  of  the  nonlinear  analysis  map  and  we  establish  theoretical  performance 
bounds  of  any  reconstruction  algorithm.  The  discussion  of  robustness  is  with  respect 
to  random  noise  and  with  respect  to  deterministic  perturbations.  We  show  that 
robust  and  uniformly  stable  reconstruction  is  not  achievable  with  the  minimum 
redundancy  for  phaseless  reconstruction.  Robust  reconstruction  schemes  require 
additional  redundancy  than  the  critical  threshold. 
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1.  Introduction 

This  paper  is  concerned  with  the  question  of  reconstructing  a  vector  a;  in  a  finite-dimensional  real  Hilbert 
space  H  of  dimension  n  when  only  the  magnitudes  of  the  coefficients  of  the  vector  under  a  redundant  linear 
map  are  known. 

Specifically  our  problem  is  to  reconstruct  x  £  H  up  to  an  overall  change  of  sign  from  the  magnitudes 
{|(®,  fk) |,  1  <  k  <  m}  where  F  =  {/i, . . . ,  fm}  is  a  frame  (complete  system)  for  H. 

A  previous  paper  [6]  described  the  importance  of  the  phaseless  reconstruction  problem.  One  particular 
case  is  when  the  coefficients  are  obtained  from  an  Undecimated  Wavelet  Transform.  This  case  is  relevant 
for  instance  in  some  audio  and  image  signal  processing  applications,  as  well  as  in  neural  computations  as 
performed  by  the  auditory  cortex  [13]. 

While  [6]  presents  some  necessary  and  sufficient  conditions  for  reconstruction,  the  general  problem  of 
finding  fast/efficient  algorithms  is  still  open.  In  [3]  we  describe  one  solution  in  the  case  of  STFT  coefficients. 

For  vectors  in  real  Hilbert  spaces,  the  reconstruction  problem  is  easily  shown  to  be  equivalent  to  a 
combinatorial  problem.  In  [7]  this  problem  is  further  proved  to  be  equivalent  to  a  (nonconvex)  optimization 
problem. 
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A  different  approach  (which  we  called  the  algebraic  approach)  was  proposed  in  [2].  While  it  applies  to 
both  real  and  complex  cases,  noiseless  and  noisy  cases,  the  approach  requires  solving  a  linear  system  of 
size  exponentially  in  the  space  dimension.  This  algebraic  approach  generalizes  the  approach  in  [8]  where 
reconstruction  is  performed  with  complexity  0(n 2)  (plus  computation  of  the  principal  eigenvector  for  a 
matrix  of  size  n).  However  this  method  requires  m  =  0{n 2)  frame  vectors. 

Recently  the  authors  of  [10]  developed  a  convex  optimization  algorithm  (a  SemiDefinite  Program  called 
PhaseLift )  and  proved  its  ability  to  perform  exact  reconstruction  in  the  absence  of  noise,  as  well  as  its 
stability  under  noise  conditions.  In  a  separate  paper  [11],  the  authors  further  developed  a  similar  algorithm 
in  the  case  of  windowed  DFT  transforms.  Inspired  by  the  PhaseLift  and  MaxCut  algorithms,  but  operating 
in  the  coefficients  space,  the  authors  of  [16  proposed  a  SemiDefinite  Program  called  PhaseCut.  They  show 
the  algorithm  yields  the  exact  solution  in  the  absence  of  noise  under  similar  conditions  as  PhaseLift. 

The  paper  [4]  presents  an  iterative  regularized  least-square  algorithm  for  inverting  the  nonlinear  map 
and  compares  its  performance  to  a  Cramer-Rao  lower  bound  for  this  problem  in  the  real  case.  The  paper 
also  presents  some  new  injectivity  results  which  are  incorporated  into  this  paper. 

A  different  approach  is  proposed  in  [1],  There  the  authors  use  a  4-term  polarization  identity  together 
with  a  family  of  spectral  expander  graphs  to  design  a  frame  of  bounded  redundancy  (™  <  236)  that  yields 
an  exact  reconstruction  algorithm  in  the  absence  of  noise. 

The  authors  of  [14]  study  several  robustness  bounds  to  the  phase  recovery  problem  in  the  real  case. 
However  their  approach  is  different  from  ours  in  several  respects.  First  they  consider  a  probabilistic  setup 
of  this  problem,  where  data  x  and  frame  vectors  fj’s  are  random  vectors  with  probabilities  from  a  class  of 
subgaussian  distributions.  Additionally,  their  focus  is  on  classes  of  k- sparse  signals.  In  our  paper  we  analyze 
stability  bounds  of  reconstruction  for  a  fixed  frame  using  deterministic  analytic  tools.  After  that  we  present 
asymptotic  behavior  of  these  bounds  for  random  frames. 

Finally,  the  authors  of  [9]  analyze  the  phaseless  reconstruction  problem  for  both  the  real  and  complex 
case.  In  the  real  case  the  authors  obtain  the  exact  upper  Lipschitz  constant  for  the  nonlinear  map  ay, 
namely  \J~B  where  B  is  the  upper  frame  bound.  For  the  lower  Lipschitz  constant,  they  give  an  estimate 
between  two  computable  singular  eigenvalues.  Our  results  have  overlaps  with  their  results.  However,  in 
our  paper  we  improve  the  lower  Lipschitz  constant  by  giving  its  exact  value.  There  are  some  significant 
differences  between  this  paper  and  [9] .  In  addition  to  studying  of  the  Lipschitz  property  of  the  map  oijp  we 
focus  also  on  two  related  but  different  settings.  First  we  study  the  robustness  of  the  reconstruction  given 
a  fixed  error  allowance  in  measurements.  Second  we  also  consider  the  Lipschitz  property  of  the  map  ay2 . 
The  authors  of  [9]  point  out  that  the  map  ay2  is  not  bi-Lipschitz.  However  in  our  paper  we  show  ay2 
becomes  bi-Lipschitz  for  a  different  metric  on  the  domain.  With  this  metric  (the  one  induced  by  the  nuclear 
norm  on  the  set  of  symmetric  operators)  the  nonlinear  map  ay 2  is  bi-Lipschitz  with  constants  indicated  in 
Theorem  4.5.  Furthermore  the  same  conclusion  holds  true  in  the  complex  case,  although  this  will  be  studied 
elsewhere. 

The  organization  of  the  paper  is  as  follows.  Section  2  formally  defines  the  problem  and  reviews  existing 
inversion  results  in  the  real  case.  Section  3  establishes  information  theoretic  performance  bounds,  namely 
the  Cramer-Rao  lower  bound.  Section  4  contains  robustness  measures  of  any  reconstruction  algorithm. 
Section  5  presents  a  stochastic  analysis  of  these  bounds.  Section  6  presents  a  numerical  example  and  is 
followed  by  references. 

2.  Background 

Let  us  denote  by  H  =  Rn  the  n-dimensional  real  Hilbert  space  Rn  with  scalar  product  (,).  Let  J-  — 
{/i, . . . ,  fm}  be  a  spanning  set  of  m  vectors  in  H.  In  finite  dimension  (as  it  is  the  case  here)  such  a  set 
forms  a  frame.  In  the  infinite  dimensional  case,  the  concept  of  frame  involves  a  stronger  property  than 
completeness  (see  for  instance  [12]).  We  review  additional  terminology  and  properties  which  remain  still 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


R.  Balan,  Y.  Wang  /  Appl.  Comput.  Harmon.  Anal.  38  (2015)  4.69-4-88 


471 


true  in  the  infinite  dimensional  setting.  The  set  F  is  a  frame  if  and  only  if  there  are  two  positive  constants 
0  <  A  <  B  <  oo  (called  frame  bounds)  so  that 

m 

A\\x\\2  <  ^2\{x,fk)  |2  <  -B || a; || 2 .  (2.1) 

k= 1 


When  we  can  choose  A  =  B  the  frame  is  said  tight.  For  A  =  B  =  1  the  frame  is  called  Parseval.  The  frame 
matrix  corresponding  to  F  is  defined  as  F  =  [/i,  /2, . . . ,  fm\  with  the  vectors  fj  £  F  as  its  columns.  We 
shall  frequently  identify  F  with  its  corresponding  frame  matrix  F.  The  largest  A  and  smallest  B  in  (2.1) 
are  called  the  lower  frame  bound  and  upper  frame  bound  of  F,  and  they  are  given  by 

A  =  A  max(FF*)  =  al(F),  B  =  A  min(FF*)  =  a2n(F)  (2.2) 

where  Amax,  Amin  denote  the  largest  and  smallest  eigenvalues  respectively,  while  cri,crn  denote  the  first  and 
n-th  singular  values  respectively.  A  set  of  vectors  F  of  the  n-dimensional  Hilbert  space  H  is  said  to  be  full 
spark  if  any  subset  of  n  vectors  is  linearly  independent. 

For  a  vector  x  £  H,  the  collection  of  coefficients  {( x,fj )  :  1  <  j  <  m}  represents  the  analysis  map 
of  vector  x  given  by  the  frame  F ,  and  from  which  x  can  be  completely  reconstructed.  In  the  phaseless 
reconstruction  problem,  we  ask  the  following  question:  Can  x  be  reconstructed  from  (|  (x,  fj)  |  :  1  <  j  <  m}? 
Consider  the  following  equivalence  relation  ~  on  H:  x  ~  y  if  and  only  if  y  =  cx  for  some  unimodular 
constant  c,  |c|  =  1.  Since  we  focus  on  the  real  vector  space  H  =  Rn,  we  have  x  ~  y  if  and  only  if  x  =  ±y. 
Clearly  the  phaseless  reconstruction  problem  cannot  distinguish  x  and  y  if  x  ~  y,  so  we  will  be  looking 
at  reconstruction  on  H  :=  H/  ~=  Rn/  ~  whose  elements  are  given  by  equivalent  classes  x  =  {x,  —  x}  for 
x  £  Rn.  The  analogous  analysis  map  for  phaseless  reconstruction  is  the  following  nonlinear  map 

aT{x)  =  [|(x,  /i>  | ,  |(x,  f2)\,  ■  ■  ■ ,  |(x,  fm)\\T  ■  (2.3) 

Note  that  a jr  can  also  be  viewed  as  a  map  from  Rn  to  RT.  Throughout  the  paper  we  will  not  make  an 
explicit  distinction  unless  such  a  distinction  is  necessary. 

Thus  the  phaseless  reconstruction  problems  aims  to  reconstruct  x  £  H  from  the  map  aj^{x).  We  say  a 
frame  F  is  phase  retrievable  if  one  can  reconstruct  x  £  H  for  all  x.  or  in  other  words,  a j  is  injective  on  H. 
The  main  objective  of  this  paper  is  to  analyze  robustness  and  stability  of  the  inversion  map,  and  to  give 
performance  bounds  of  any  reconstruction  algorithm. 

Before  proceeding  further  we  first  review  existing  results  on  injectivity  of  the  nonlinear  map  aj In 
general  a  subset  Z  of  a  topological  space  is  said  generic  if  its  open  interior  is  dense.  However  in  the 
following  statements,  the  term  generic  refers  to  Zarisky  topology:  a  set  Z  c  Knxm  =  Kn  x  ■  ■  •  x  Kn  is 
said  generic  if  Z  is  dense  in  Knxm  and  its  complement  is  a  finite  union  of  zero  sets  of  polynomials  in  nm 
variables  with  coefficients  in  the  field  K  (here  K  =  R). 

Theorem  2.1.  Let  F  be  a  frame  in  H  =  Rn  with  m  elements.  Then  the  following  hold  true: 

1.  The  frame  F  is  phase  retrievable  in  H  if  and  only  if  for  any  disjoint  partition  of  the  frame  set  F  = 
F\  U  F2,  either  F\  spans  Rn  or  F2  spans  Rn. 

2.  If  F  is  phase  retrievable  in  H  then  m>2n  —  l.  Furthermore,  for  a  generic  F  with  m  >  2n  —  1  the  map 
aj=  is  phase  retrievable  in  H . 

3.  Let  m  =  2n  —  1.  Then  F  is  phase  retrievable  in  ‘M'kf  and  only  if  F  is  full  spark. 
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4.  Let 


an  :=  min  >  I 

NMIvINi"1 

j=1 


[xJj)\2\{y,fj)\2  >  o, 


(2.4) 


so  that 


^2\(x,fk)\2\(y,fk)\2  >  ao||a;||2||y||2-  (2.5) 

k= 1 


Then  F  is  phase  retrievable  on  H  if  and  only  if  ao  >  0. 
5.  For  any  ieR"  define  the  matrix  R(x )  by 


R(x):=J2\(xJj)\2fif;-  (2-6) 

3  =  1 

Let  Xmin(R(x ))  denote  the  smallest  eigenvalue  of  R(x),  and  let  ao  =  min||x||=1  Amin(i?(a:)).  Equivalently 
let  ao  be  the  largest  constant  so  that  R(x)  >  ao||x||2/  for  all  x  e  H,  where  I  is  the  identity  matrix. 
Then  T  is  phase  retrievable  on  H  if  and  only  if  ao  >  0. 

Additionally  the  constant  ao  introduced  here  is  the  same  as  the  constant  ao  given  by  (2-4) ■ 

The  results  (l)-(3)  are  in  [6],  and  (4)-(5)  are  in  4] . 

3.  Information  theoretic  performance  bounds 

In  this  section  we  derive  expressions  for  the  Fisher  Information  Matrix  and  obtain  performance  bounds 
for  reconstruction  algorithms  in  the  noisy  case. 

Consider  the  following  noisy  measurement  process: 

Uk  =  \{x,fk}\2  +  vk,  Ok  ~AA(0,  a2),  \<k<m  (3.1) 

where  the  noise  model  is  AWGN  (additive  white  Gaussian  noise):  each  random  variable  i is  independent 
and  normally  distributed  with  zero  mean  and  a2  variance. 

Consider  the  noiseless  case  first  (that  is  Vk  —  0).  Obviously  one  cannot  obtain  the  exact  vector  x  £  H 
due  to  the  global  sign  ambiguity.  Instead  the  best  outcome  is  to  identify  (that  is,  to  estimate)  the  class 
x  =  (x,  —  x}  from  aj?(x).  As  such,  we  fix  a  disjoint  partition  of  the  punctured  Hilbert  space  H ,  Rn  \  {0}  = 
LI i  U  J?2,  such  that  LIo  =  —L2\.  We  make  the  choice  that  the  vector  x  belongs  to  LI-\ .  Hence  any  estimator 
of  x  is  a  map  oj  :  Hi  U  {0}.  Denote  by  L2\  its  interior  as  a  subset  of  Kn.  Such  a  decomposition  is, 

for  example 


i?i  =  {x  G  :  Xk  >  0,  Xj  =  0  for  j  <  A:}. 

k= 1 


Note  its  interior  is  given  by  Q\  =  {x  G  Mn,  X\  >  0}. 

Under  these  assumptions  we  compute  the  Fisher  Information  matrix  (see  [15]).  This  is  given  by 

(I(*))fcj-  =  E  [(V log L(x))  (V  log L{x))T]  (3.2) 

where  the  likelihood  function  L{x)  is  given  by  245 
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L(x)=p(y\x)=  (27r)i/2am  f>  ~  |<*,  A)!2!2)  •  (3'3) 

After  some  algebra  (see  [4])  we  obtain 

A  m 

I{x)  =  -^R{x),  R{x)  =  Y^\(xJj)\2  fjfj-  (3-4) 

3= 1 

Note  the  matrix  R(x )  is  exactly  the  same  as  the  matrix  introduced  in  (2.6).  Thus  we  obtain  the  following 
results: 

Theorem  3.1.  The  frame  T  is  phase  retrievable  if  and  only  if  the  Fisher  information  matrix  I(x)  is  invertible 
for  any  x  ^  0. 

When  F  is  phase  retrievable  let  ao  be  the  positive  constant  introduced  in  (2.f).  Then 


I(a;)  >  -^||a;||2/ 


o- 


(3.5) 


where  I  is  the  n  x  n  identity  operator. 

This  allows  to  state  the  following  performance  bound  result  (see  [15]  for  details  on  the  Cramer-Rao  lower 
bound). 

Theorem  3.2.  Assume  x  e  fi\.  Let  uj  :  — >  i?i  be  any  unbiased  estimator  for  x.  Then  its  covariance 

matrix  is  bounded  below  by  the  Cramer-Rao  lower  bound: 

Cov[w(y)]  >  (I(x))_1  =  ~-(R(x)y\  (3.6) 

Furthermore,  any  efficient  estimator  (that  is,  any  unbiased  estimator  uj  that  achieves  the  Cramer-Rao  Lower 
Bound  (3.6))  has  the  covariance  matrix  bounded  from  above  by 

2 

Cov[w(y)]  <  °  2 1  (3J) 

4a0||*|| 


and  Mean-Square  error  bounded  above  by 


o 

1 1  9  TICT 

MSE(w)  =  E[  w(y)  -  x\\  ]  <  - (3.8) 

4a0|MI 

Remark  3.3.  We  point  out  the  importance  of  the  constant  ao  introduced  in  (2.4).  On  the  one  hand  it 
represents  a  necessary  and  sufficient  condition  for  phase  retrievability  as  stated  in  Theorem  2.1.  On  the 
other  hand  the  above  results  prove  that  ao  provides  also  a  bound  for  the  Fisher  Information  matrix  and 
hence  a  bound  for  any  efficient  estimator  of  x.  The  larger  this  constant  ao,  the  smaller  the  variance  of 
the  efficient  estimator.  As  we  prove  in  the  next  section,  the  same  constant  ao  represents  the  lower  Lips- 
chitz  bound  for  the  map  o^2  (4.13)  considered  between  ( H,d\ )  and  the  Euclidean  space  (Rm,  ||  ■  ||)  -  see 
Theorem  4.5.  246 
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Additionally,  similar  expressions  involving  the  bound  ao  occur  in  the  complex  case  as  well.  Both  the 
stochastic  bound  above  and  the  bi-Lipschitz  result  in  Theorem  4.5  can  be  extended  to  the  complex  case  - 
see  [5]. 


4.  Robustness  measures  for  reconstruction 


In  this  section  we  analyze  the  robustness  of  deterministic  phaseless  reconstruction.  Additionally  we  con¬ 
nect  the  constant  ao  introduced  earlier  in  Theorem  2.1  and  used  in  Theorem  3.1  to  quantities  directly 
computable  from  the  frame  F . 

Our  approach  is  to  analyze  the  stability  in  the  worst  case  scenario,  for  which  we  consider  the  following 
measures.  Denote  d(x,  y )  :=  min(||x  —  y ||,  ||x  +  y||).  For  any  and  e  >  0  define 


Qe(x) 


max 

{y-\\^(x)-ajr(y)\\<e} 


d(x,y) 

£ 


(4.1) 


The  size  of  Qe(x)  measures  the  worst  case  stability  of  the  reconstruction  for  the  vector  x,  under  the 
assumption  that  the  total  noise  level  is  controlled  by  e.  We  also  study  the  global  stability  by  analyzing  the 
measures 


(fe  :=  max  Q£(x),  q0  :=  limsupg£,  <?oo  :=  sup ge.  (4.2) 

Ml  =  l  £  r  0  £>0 

Here  ||.||  denotes  usual  Euclidian  norm.  Note  that  Qe{x )  has  the  scaling  property  Qe(x)  =  Q\c\e{cx)  for  any 
real  0.  Thus  it  is  natural  to  focus  on  unit  vectors  x. 

We  introduce  now  some  quantities  that  play  key  roles  in  the  estimation  of  these  robustness  measures. 
For  the  frame  F  let  F  =  [/i,  f2,  •  •  • ,  fm\  be  its  frame  matrix.  Denote  by  F[S]  =  {/*,,  k  G  S}  the  subset  of 
F  indexed  by  a  subset  S  C  {1,2,...,  m},  and  by  Fs  the  frame  matrix  corresponding  to  •7r[S']  (which  is  the 
matrix  with  vectors  in  F[S]  as  its  columns).  Set 

A[S]:=a2n(Fs)  =  \min{FsF*s),  (4.3) 

where  as  usual  an  and  Amin  denote  the  n-th  singular  value  and  the  minimal  eigenvalue,  respectively.  Note 
that  A  [5]  is  in  fact  the  lower  frame  bound  of  F[S] . 

Let  S  denote  the  collection  of  subsets  S  of  {l,2,...,m}  so  that  dim(span(Jr[S'c]))  <  n,  where  Sc  = 
{1,2,...,  to}  \  S  is  the  complement  of  S.  In  other  words,  rank(i}gc)  <  n.  Denote  by  A  and  u>  the  following 
expressions: 


A  =  min  \J AfS1]  +  A  [S’0] 

(4.4) 

u)  =  min  an(Fs). 

S 

(4.5) 

All  of  them  depend  of  course  on  F.  However  since  we  fix  F  throughout  the  paper,  we 
reference  F  in  the  notation  for  simplicity  as  there  will  not  be  any  confusion.  Clearly 

shall  not  explicitly 

A  <oo. 

Proposition  4.1.  Let  e  >  0.  Then  the  stability  measurement  function  Qe(x)  is  given  by 

(4.6) 

Qe(x)  =  -  max24?nin{ ||wi  |,  \\w2  |} 

£  (w1,w2)er 
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where  the  constraint  set  T  is  given  by 


r=  j(wi,w2)  |  ~(w1  +  w2)  =  x,  min |  , \(fj,w2)\  )  =  \\Fswi\\  +  <e2|. 


(4.8) 


where  S  :=  S(w1,w2)  =  {j  :  \(fj,wi)\  <  \{fj,w2)\}. 

Proof.  For  any  1,1/eR"  let  w\  —  x  +  y  and  w2  —  x  —  y.  Then  x  —  \{w\  +  w2)  and  y  =  \{w\  —  w2).  It  is 
easy  to  check  that  for  S  =  {j  :  \{fj,wi)\  <  \{fj,w2)\}  we  have 


\(fj,x)\  ~  | (fj,y) 


Hfj>w i>  3  e  S, 
±{fj,w2)  jeSc. 


In  other  words, 


\(fj,x)\  -  \(fj,v)\  =  min(|(/j,wi)|,  |(/J-,ttf2)|).  (4.9) 

Let  F  be  the  frame  matrix  of  F .  We  thus  have 

\\aT(x)  -  aT(y)\\2  =  ^|(/j,wi)|2  +  ^ \{fj,w2)\2  =  ||F|u;i||2  +  1 1 ^2 1 1 2 - 
jes  jesa 

Note  that  d(x,y)  =  min(||u;i||,  || 1| ) •  The  proposition  now  follows.  □ 

The  above  proposition  allows  us  to  establish  the  following  stability  result  for  the  worst  case  scenario. 

Theorem  4.2.  Assume  that  the  frame  F  is  phase  retrievable.  Let  A  >  0  be  the  lower  frame  bound  for  the 
frame  F  and  let  t  :=  min{crri(i's)  :  S  C  {1, . . . ,  m},  rank(Fs)  =  n}. 

(A)  For  any  e  >  0  we  have 


f  1  1  )  1 

mm<  -  >  <  qe  <  yr- 

[  e  OJ  J  A 

(B)  If  e  <  t  then  q£  =  W  Consequently  qo  = 

(C)  For  any  nonzero  ieR"  and  any  0  <  e  <  Ax  we  have 


Qe{x) 


1 

Vi’ 


where 


Ax  := 


2  r 


maxdl/jll)  +  r 


mm{\(fj,x)\:  {fj,x)  ^  0}. 


(D)  The  upper  bound  equals  the  reciprocal  of  A: 


1 

q°24S  A' 
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Proof.  To  prove  (A)  we  first  establish  the  upper  bound  in  (4.10).  Let  x  E  Mn.  By  Proposition  4.1  we  have 

Qe{x)  =  1  max  mini Hu'iH,  H^H} 
e  wi,w2 

under  the  constraints  \{w\  +  w  2)  =  x  and 

lli^ill2  +  tet/^ll2  <  £2 


for  some  S.  Now  assume  without  loss  of  generality  that  ||u>i||  <  11^2 1|-  Then 

-2  HF^f+HFIctoall2 


1 


12  ^ 


Ri 


>  al(Fs)  +  vl{FSc) 


>  A. 


\w2\ 

|«h| 


It  follows  that 

-  min{||u>i||,  ||u>2||}  <  ]r. 
e  A 

Thus  Qs{x)  <  3. 

To  establish  the  lower  bound  in  (4.10)  we  construct  for  any  e  >  0  an  x  G  and  vectors  w\,w2  satisfying 
the  imposed  constraints.  Let  S  be  a  subset  of  {1,2, ...  ,m}  such  that  rank(Tsc)  <  n  and  crn(Fs )  =  to. 
Choose  vi,v2  E  Mn  with  the  property  ||ui||  =  ||u2||  =  1  and 

||F|«i||=w,  F*Scv2  =  0. 

Set 


t  =  min 


£ 

5 

id 


and  w\  =  tv  1 . 


Hence  ||ici||  =  t  <  1.  Now  we  select  an  s  e  1  so  that  ||u>i  +  SU2II  =  2.  This  is  always  possible  since 
s  i-»  \\wi  +  SU2II  is  continuous  and  ||u;i  +  OU2II  =  t  <  1  <  2  <  ||u>i  +  3^2 1| .  Set  u>2  =  sv2  so  ||u;i  +  u>2||  =  2. 
We  have 


s|  =  \\sv2\\  >  ||u>i  +  SU2||  —  ll^ill  =  2  —  t  >  1. 


Thus  \\vu2 1|  >  ||u>i||.  Now  let 


We  have  then 


x  =  -(wi+w2)  and  y  =  -(w1-w2). 


m 

ajr(x)  -  ajr(y)\\2  =  ^  min(|  (fj,  wx)  |2, \{fj,w2)\2) 
i=i 

<^2\{fj,wi)\2  +  EK/p-2)I2 

jes  jes* 

=  t2^  e2. 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


R.  Balan,  Y.  Wang  /  Appl.  Comput.  Harmon.  Anal.  38  (2015)  4.69-4-88 


477 


Furthermore 


d(x,y)  =  min(||'u;i||,  ||re2||)  =  ||wi||  =  t. 


Hence  for  this  x  we  have 


=  nun 


It  follows  that  q£  >  min{  1,  4}.  Now  by  taking  e  >  0  sufficiently  small  we  have  qe  >  4. 

We  now  prove  (B).  Assume  that  e  <  min{an(F5)  :  rank(F5)  =  n}.  Then  clearly  we  have  e  <  uj.  Thus 
by  (4.10)  we  have  qe  >  4.  Again  for  each  x  £  M”  with  ||at||  =  1  we  consider  u>i,u)2  for  the  estimation  of 
qE(x).  The  constraint  ||u>i  +  u>2||  —  2  implies  either  ||tci||  >  1  or  ||tt)2||  >  1-  Without  loss  of  generality  we 
assume  that  ||u>i||  >  1.  For  the  constraint  ||F^u!i||2  +  ||i^0tC2||2  <  e2  for  some  S,  assume  that  rank(Fs)  =  n 
then  we  have 


FgW 1 1|  >  an(Fs)||u;i||  >  min{crn(Fs):  rank(Fs)  =  n}  >  e. 


This  is  a  contradiction.  So  rank(.F,g)  <  n  and  hence 

e2  >  ||F^ci||2  +  ll^e^H2  >  ll-FscWsll2  >  cc2||u;2||2 

Thus  || rc2 1  <  — .  Proposition  4.1  now  yields  qe  =  4;  proving  part  (B). 

Now  we  prove  (C).  We  go  back  to  the  formulation  in  Proposition  4.1. 

Qe(x )  =  -  max  min{||u>i||,  ||tt>2|j} 


under  the  constraints  \{w\  +  W2)  =  x  and 


+  \\F£cw2 


where  S  :=  S(w i,w2)  =  {j:  \(fj,w i)|  <  \(fj,W2)\}-  Since  ajr  is  injective,  either  rank(Fjs)  =  n  or 
rank(i'sc)  =  n  by  Theorem  2.1  (1).  Without  loss  of  generality  we  assume  rank(Fs)  =  n.  Thus  £  >  ||F,|u;i||  > 
r || || .  So  ||rci||  <  e/t.  We  show  that  for  any  k  £  Sc  we  must  have  (fk,x)  =  0.  Assume  otherwise  and  write 
w2  =  2x-  wi,  Lx  :=  min{|(/j,x)|  :  (fj,x)  0}.  Then 


\(fk,w2)\  >  2\(fk,x)\  -  \{fk,W!)\  >  2 L. 


>x  —  max 


(\\fj\\)\\w1\\>2L: 


<x  —  max 


(ll/ill) 


This  is  a  contradiction.  Thus  for  k  £  Sc  we  have  ( fk,x )  =  0  and 


\(fj,w2)\  =  |  (fj,  2%  ~  «h)|  =  \  {fj,wx) 


It  follows  that 


K)u;i||2  +  ||F,5c'ia2||2  =  ||F*ut||2  ^  e2- 


Thus  ||ioi||  <  s/VA  and  hence  Qe(x)  <  Now  we  show  the  bound  can  be  achieved.  Let  w\  satisfy 

||F*u>i||  =  VAII^ill  =  £•  Such  a  w\  always  exists.  Then  clearly  w\  and  w2  =  2x  —  w\  satisfy  the  required 
constraints,  and  it  is  easy  to  check  that  min(||w;i||,  |j«®|)  =  ||uq||  =  ej\fA. 
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Finally  we  prove  (D).  By  the  result  at  part  (A),  It  is  therefore  sufficient  to  show  that  Qe(x)  >  3 

for  some  x  and  e.  Let  So  be  the  subset  that  achieves  the  minimum  in  (4.4).  Let  u,  v  E  H  be  unit  eigenvectors 
corresponding  to  the  lowest  eigenvalues  of  Fs0Fgo  and  FsgF^c  respectively.  Thus 

||f^||2  =  a[s0],  ||f^||2  =  a[s0c] 

Let  x  =  (u  +  v)/ 2  and  e  =  A,  and  set  w\  =  u,  w 2  =  v.  Then  by  Proposition  4.1 

min(||wi||,  ||w2||)  _  1 


since 


^min(|(/i,n;i)|2,  |</i,w2)|2)  <  ||^i||2+||F^2||2  =  £2 
3= 1 

This  concludes  the  proof.  □ 

Remark.  It  may  seem  strange  that  Qe(x)  =  for  all  x  7^  0  and  sufficiently  small  e  while  go  —  where 
uj  is  typically  much  smaller  than  \[A.  The  reason  is  that  for  Qe(x)  =  to  hold,  e  depends  on  x.  Thus  we 
cannot  exchange  the  order  of  lim  supe^0  and  max||x||=1. 

Related  to  the  study  of  stability  of  phaseless  reconstruction  is  the  study  of  the  Lipschitz  property  of  the 
map  ajr  on  H  :=  Rn/  ~.  We  analyze  the  bi-Lipschitz  bounds  of  both  a?  and  a^2,  which  is  simply  the  map 
aj  with  all  entries  squared,  i.e. 


aY(x):=  [\{fj,x)\2,...,\{fm,x)\2]T.  (4.13) 

We  shall  consider  two  distance  functions  on  H  =  Mn/  the  standard  distance  d(x,y )  :=  min(||x  —  y ||, 
||a:  +  y||)  and  the  distance  di(x,y)  :=  \\xx*  —  yy*  ||i  where  ||X||i  denotes  the  nuclear  norm  of  X,  which  is  the 
sum  of  all  singular  values  of  X.  Specifically  we  are  interested  in  examining  the  local  and  global  behavior  of 
the  following  ratios 


U{x,y) 


ajr{x)  -  aT(y)  || 
d(x,y) 


V(x,y) 


otYjx)  -  QLjr2{y) || 
di{x,y) 


(4.14) 


While  all  norms  in  finite  dimensional  spaces  are  equivalent,  we  choose  to  consider  d-\ ,  the  nuclear  norm 
induced  distance  on  H,  because  the  Lipschitz  lower  and  upper  bounds  are  very  much  related  to  the  matrix 
R(x)  introduced  in  Theorem  2.1. 

We  first  investigate  the  bounds  for  U(x,y).  For  this  the  upper  bound  is  relatively  straightforward.  Let 
wi  =  x  —  y  and  u>2  =  x  +  y.  We  have  already  shown  in  the  proof  of  Theorem  4.2  using  (4.9)  that 


m 

aT(x)  -  aT(y)\\  =  ^  min(|  wx)  \  ,\(fj,w2)\  ) 

3= 1 

!m  m 

J2\(fj,'W1)\2,^2\{fj,W2) 

3= 1  3  =  1 

<  B infill wi|| 2 ,  ||w2||2}  =  Bd2(x,y), 
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where  B  is  the  upper  frame  bound  of  the  frame  F.  Thus  U(x,y )  has  an  upper  bound  U(x,y)  <  VB. 
Furthermore,  the  bound  is  sharp.  To  see  this,  pick  a  unit  vector  such  that  Y'jLi  |(/j,  wi)\1 2  —  B  and 

set  y  =  2x.  Then  U(x,  y )  =  \J~B. 

To  study  the  lower  bound  U{x,y)  we  now  consider  the  following  quantities: 


Pe(x) 

p(x) 


inf 

{y.d(x,y)<e} 

lim  inf 

{y:d(x,y)->-0} 


u{x,y), 

U(x,y) 


Po  ■=  inf  p(x), 

X 


Poo  :=  inf  U(x,  y). 

d(x,y)>0 


lim  inf  pe{x), 

e— >0 


We  apply  the  equality 


U2(x,y ) 


Em 
7=1 


nhn(|(/j,u;1)l2,  |(/j,w2)|2) 
min(||u>i||2,  ||tt>2||2) 


where  again  w\  —  x  —  y  and  W2  =  x  +  y.  Now  fix  x  and  let  d(x,  y)  <  e.  Without  loss  of  generality  we  may 
assume  \\y  —  x||  <  e.  Thus  ||?ni||  <  e  and  ||w2  —  2x\\  =  ||u>i||  <  e.  Let  S  =  {j,  (fj,x)  ^  0}  and  set 


min fcgg  |(/fc,x)| 

maxfegs  1 1 /fe|| 


Note  for  any  w\  with  ||rfi||  <  £q  and  k  G  S  we  have 


(4.15) 


\{fk,W2)\  =  1 2(/fe,  x)  -  {fk,W  i)|  >  2\{fk,x)\  -  \{fk,W!)\  >  2£0(x)||/fc||  -  ||wi||  ||/fe||  >  |(/fc,Wl)|, 
whereas  for  k  £  Sc  we  have 


\(fk,w2)\  =  |(/fc,Wl)|. 

Hence  min(|(/j,  W\)\2,  \{fj,w2)\2)  —  |(/j,^i)|2  for  all  j  whenever  e  <  Sq(x).  It  follows  that 


U2(x,  y) 


E7=il</^1>I2 


Thus  U2(x,y )  >  A  where  A  is  the  lower  frame  bound  for  the  frame  J- .  Furthermore  this  lower  bound  is 
achieved  whenever  w\  =  x  —  y  is  an  eigenvector  corresponding  to  the  smallest  eigenvalue  of  FF*.  This 
implies  that 


Pe(x)  =  VA 

whenever  e  <  £q{x).  Consequently  p(x)  =  s/~A.  We  have  the  following  theorem: 

Theorem  4.3.  Assume  that  the  frame  F  is  phase  retrievable.  Let  A,  B  be  the  lower  and  upper  frame  bounds 
for  the  frame  F,  respectively  and  for  each  iel",  let  £o(x)  be  given  in  (4.15).  Then 

(1)  U(x,y )  <  \/~B  for  any  x,y  6  8"  with  d(x,y)  >  0. 

(2)  Assume  that  e  <  £o(®)-  Then  p£(x)  =  \fA.  ConS&fkently  p(x)  =  po  =  VA- 
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(3)  A  =  poo  <  w  <  po  =  p(x)  =  y[A. 

(4)  The  map  ajr  is  bi-Lipschitz  with  (optimal)  upper  Lipschitz  bound  \[B  and  lower  Lipschitz  bound  p^: 


Pood(x,y)  <  ||a^(x)  -  aT(y)\\  <  \/Bd(x,y),  \/x,y  e  H 


Proof.  We  have  already  proved  (1)  and  (2)  of  the  theorem  in  the  discussion.  It  remains  only  to  prove  (3) 
since  (4)  is  just  a  restatement  of  (1)  and  (3).  Note  that 


P 


2 

oo 


inf  U‘ 

d(x,y)>0 


!(x,y) 


inf 

Wl,W2^0 


ir-^m 

A.  7  =  1 


min(|(/j;wi)|2 *, \{fj,w2)\2) 
min(||wi||2,  ||u>2||2) 


For  any  w\ ,  W2 ,  assume  without  loss  of  generality  that  0  <  ||u>i||  <  ||u>2|| .  Let  S  =  {j  :  \(fj,Wi)\  <  \(fj,W2)\}- 
Set  v\  =  wi/||ioi||,  V2  =  to2/||w2||  and  t  =  ||i02||/||uh||  >  1-  Then 


YJj=i  min (|  (fj,Wi)\2,  |  (fj,W2)\2) 

2  2 

min(||u>i||  ,  \\w2 1|  ) 


Y){fi,vi)\2 +t2  ^2\{fj,v2) 

jes  jes 0 


>  EKW  +  K/i  >w2>|2 

jes  jes c 


>  A2. 


Hence  p ^  >  A. 

Let  S  and  u,  v  e  H  be  normalized  (eigen)  vectors  that  achieve  the  bound  A,  that  is: 

IMI  =  \M  =  1,  ^2\{u,fk)\2  +  \(v’fk)\2  =  A2. 

kes  kesc 

Set  x  =  u  +  v  and  y  —  u  —  v.  Then,  following  [9] 


\ajr(x)  -  aT(y)\\2  =  |  (u  +  v,  fk)  \  -  |  (u  -  v,  fk)  |  |2  +  ^  1 1  (u  +  v,  fk)  \  ~  |  (u  -  v,fk)  \ 

kes  kesc 

<4(^2\(uJk)  |2+  =4A2- 


On  the  other  hand 


d(x,y)  =  min(||a:  —  y||,  ||x  +  y||)  = 2 . 


Thus  we  obtain 


aA  x)  -  ajrjy)  ||  <  A 
d{x,  y) 


The  theorem  is  now  proved.  □ 


Remark.  The  two  quantities,  poo  and  goo  satisfy  poo  —  — .  However  there  are  subtle  differences  between 
Qe{x)  and  p£{x)  so  that  the  simple  relationship  pe(x)  =  1  /Qe(x)  does  not  usually  hold.  One  such  difference 
is  due  to  the  significance  of  e  for  the  two  bounds.  S£i3the  numerical  example  presented  in  the  last  section. 
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Remark.  The  upper  Lipschitz  bound  \5b  has  been  obtained  independently  in  [9].  The  lower  Lipschitz 
bound  we  obtained  here  strenghtens  the  estimates  given  in  [9].  Specifically  their  estimate  for  p <*,  reads 
<j  <  poo  <  -\/2<x  where 


er  =  nhnmax(a„(Ts),  crn(TSc)) 


(4.16) 


Clearly  a  <  A  < 

We  conclude  this  section  by  turning  our  attention  to  the  analysis  of  V{x,y).  A  motivation  for  studying 
it  is  that  in  practical  problems  the  noise  is  often  added  directly  to  a jf1  as  in  (3.1)  rather  than  to  a?.  Such 
noise  model  is  used  in  many  studies  of  phaseless  reconstruction,  e.g.  in  the  Phaselift  algorithm  [10],  or  in 
the  IRLS  algorithm  in  [4] . 

Let  Symn(R)  denote  the  set  ofnxn  symmetric  matrices  over  M.  It  is  a  Hilbert  space  with  the  standard 
inner  product  given  by  {X,Y)  :=  tr(AWT)  =  tr (XY).  The  nonlinear  map  a jA  actually  induces  a  linear 
map  on  Symn(R).  Write  X  =  xxT  for  any  x  G  Rn.  Then  the  entries  of  aY{x)  are 

(ajr2{x)).  =  \{fj,x)\2  =  xTfjfjx  =  tr  (FjX)  =  (. Fj,X ),  (4.17) 

where  Fj  :=  fjfj-  Now  we  denote  by  A  the  linear  operator  A  :  Symn(R)  — >  Rm  with  entries 

{A{X)).  =  (Fj,X)  =  \X{FjX). 

Let  S^,q  be  the  set  ofnxn  real  symmetric  matrices  that  have  at  most  p  positive  and  q  negative  eigenvalues. 
Thus  SY  denotes  the  set  of  n  x  n  real  symmetric  non-negative  definite  matrices  of  rank  at  most  one.  Note 
that  spectral  decomposition  easily  shows  that  X  G  SY  if  and  only  if  X  =  xxT  for  some  i£l". 

The  following  lemma  will  be  useful  in  this  analysis 

Lemma  4.4.  The  following  are  equivalent. 

(A)  xesy. 

(B)  X  =  xxT  —  yyT  for  some  x,  y  G  Rn. 

(C)  X  =  |(u;i wj  +  W2wf)  for  some  wi,W2  G  Rn. 

Furthermore,  for  X  =  +  W2wf)  its  nuclear  norm  is  ||A||i  =  ||u;i||||ri>2||- 

Proof.  (A)  =>  (B)  is  a  direct  result  of  spectral  decomposition,  which  yields  X  =  PiUiuf  —  for  some 

U\,U2  G  Rn  and  /?i,^2  >  0.  Thus  X  =  xxT  —  yyT  where  x  :=  \ffi\U\  and  y  :=  \f]$2U2- 
(B)  =>  (C)  is  proved  directly  by  setting  w\  =  x  —  y  and  u>2  =  x  +  y. 

We  now  prove  (C)  =>  (A)  by  computing  the  eigenvalues  of  X  =  \(w\ +W2wf).  Obviously  rank(A)  <  2. 
Let  Ai,  A2  be  the  two  (possibly)  nonzero  eigenvalues  of  X.  Then 

Ai  +  A2  =  tr{A}  =  (w1,w2), 

A?  +  A2  =  tr  {  X2  }  =  (||u)i||2||u>2||2  +  \{w1,w2)\2) /2. 

Solving  for  eigenvalues  we  obtain 


Ai  = 


A2  = 

DISTRIBUTION 


1 

2 

1 


I 


({wi,w2)  +  ||?ni||||w2||), 

({wi,W%p/L  ||l0l||||w2||)- 
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Hence,  by  Cauchy-Schwarz  inequality,  Ai  >  0  >  A2  which  proves  X  £  S'*’1-  Furthermore,  it  also  shows  that 
the  nuclear  norm  of  X  is  ||X||i  =  |Ai|  +  | A2I  =  ||ufi||||ut2||-  □ 

Now  we  analyze  V(x,y).  Parallel  to  the  study  of  U(x,y)  we  consider  the  following  quantities: 


Me  0*0 

M<» 


.  V{x,y), 

{ y:d(x,y)<e } 

liminf  V(x,y) 

{y:d(x,y)^  0} 


Mo  :=  inf  fj,(x), 

X 


Moo  :=  inf  V(x,y), 

d(x,y)>0 


lim  inf  fie  (x) , 
e— rO 


as  well  as  the  upper  bound  supdl(x  ,V)>ov(x^y)-  By  (4-17)  we  have  \(fj>x) I2  -  l(/j>y)l2  =  where 

Fj  =  fjfj  an<4  x  =  xx'  ~  yyT ■  Hence 


V2(x,y)  = 


E7=  1 1  <*),*>  I2 

11*11?  ' 


Set  wi  =  x  —  y  and  W2  =  x  +  y  and  apply  Lemma  4.4  we  obtain 

ET=  1  I(/j.«;i)|2|(/j»«,2>|2 


V  (x,y)  = 


We  can  immediately  obtain  the  upper  bound: 


I  wi  II 2  |kt^2  II 2 


(4.18) 


V(x,y)<l  sup  ^2\{fj,ei)\2\(fj,e2)\ 

\||e1||  =  l;||e2||=l 


1/2 


=  max 

V  l|e||  = 


1/2 


=:  Ajr2 


3= 1 


where  Aj -  denotes  the  operator  norm  of  the  linear  analysis  operator  T  :  H  — »  Km,  T(x)  =  {(x,  fk))™=1 
dehned  between  the  Euclidian  space  H  =  M.n  and  the  Banach  space  endowed  with  the  /  ’-norm: 


1/4 


(4.19) 


k=  1 


Note  also  that 


At2  =  max  Amax(i?(x)) 

IMI=i 

where  R(x)  was  defined  in  (2.6).  An  immediate  bound  is  Ajr  <  \J~B max  \\fk\\  with  B  the  upper  frame  bound 
of  T. 

Fix  and  let  d(x,  y)  — >  0.  Then  either  y  — >  x  or  y  — >  —x.  Without  loss  of  generality  we  assume  that 

x  — »  y.  Thus  w\  =  x  —  y  — >  0  and  W2  =  x  +  y  — >  2x.  However  uti/lluqll  can  be  any  unit  vector.  Thus 

M2(N  =  7^2  ..iaf  Y^\(fj,x)\2\(fj,u)\2  =  inf  (R(x)u,  u)  =  Amin(i2(s)) 

INI  M =1j=±  Nil  ii“ii=1  ||x|| 

where  R{x)  was  introduced  in  (2.6).  Thus  we  obtai&55 
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/ 1  (:t)  | -  M  2  ^min  (^(*^))  ;  h()  411  ill  Amin  (./?(?/))  . 

m  M=i 


On  the  other  hand  note 


d(i“)>o v(x'y)  =  - mW - =  ihi “  (  (  >>  =  °’ 

where  a o  was  introduced  in  (2.4).  Thus  we  proved: 

Theorem  4.5.  Assume  the  frame  F  is  phase  retrievable.  Then 

—  ||  n  \J Amjn  , 

hoc  =  ho  =  min  J  A  min(i?(n))  =  y/af}. 

m:||u|  =  1  v 


(4.20) 

(4.21) 


Furthermore  aj-2  is  bi-Lipschitz  with  upper  Lipschitz  bound  Ajr2  and  lower  Lipschitz  bound  y/ao: 

yfad^y)  <  \\aT2{x)  -  aj^2(y)\\  <  Ajr2di(x,y) 

where  ap  is  the  same  positive  constant  used  in  Theorems  2.1  and  3.1,  and  Ajr  is  the  norm  of  the  analysis 
operator  defined  between  the  Euclidian  space  H  and  Z4({1,  2, . . . ,  m}). 

Remark.  Note  that  the  distance  d(., .)  is  not  equivalent  to  di(., .).  Theorem  4.5  now  also  implies  that  ajr2 
is  not  bi-Lipschitz  with  respect  to  the  distance  d(., .)  on  H.  This  fact  was  pointed  out  in  [9]. 

5.  Robustness  and  size  of  redundancy 


Previous  sections  establish  results  on  the  robustness  of  phaseless  reconstruction  for  the  worst  case  sce¬ 
nario.  A  natural  question  is  to  ask:  can  “reasonable”  robustness  be  achieved  for  a  given  frame,  and  in 
particular  with  small  number  of  samples?  We  shall  examine  how  scales  as  the  dimension  n  increases. 

Consider  the  case  where  m  =  2n— 1.  This  is  the  minimal  redundancy  required  for  phaseless  reconstruction. 
In  this  case  any  frame  J-  would  have  A  =  u.  Hence  we  have  min{l/u;,  1/e}  <  q£  —  1/uj.  The  stability  of 
the  reconstruction  is  thus  mostly  controlled  by  the  size  of  l/uo.  The  question  is:  how  big  is  u),  especially  as 
n  increases? 

Assume  that  the  frame  elements  of  J-  are  all  bounded  by  L,  \\fj\\  <  L  for  all  fj  £  F .  Consider  the  n  +  1 
elements  {fj  :  j  =  1, . . . ,  n+1}.  They  are  linearly  dependent  so  we  can  find  Cj  £  E  such  that  cjfj  =  0- 

Without  loss  of  generality  we  may  assume  \cn+i\  =  min{|cj  | }.  Set  v  =  [a,  c^,  ■  ■  ■ ,  cn]T .  Let  G  =  [/i,  •  • . ,  /„]. 
Then  Gv  =  YJj=i  cjfj  =  ~cn+ifn+i-  Now  all  \cj\  >  |cn+i|  so  ||v||  >  y/n\cn+1\.  Thus 

||Gu||  =  |cn+1|||/n+i||  <  -±=\\v\\. 

\Jn 

It  follows  that  an(G)  <  ,  and  hence 


(5.1) 


Note  that  here  we  have  considered  only  the  first  n  +  1  vectors  of  the  frame  F .  The  actual  value  of 
ui  will  likely  decay  much  faster  as  n  increases.  In  a  preliminary  work  we  are  able  to  establish  the  bound 
oj  <  CL/y/nA  where  C  is  independent  of  n  [18].  But^^n  this  estimate  is  likely  far  from  optimal. 
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Conjecture  5.1.  Let  m  —  2n  —  1  and  \\fj\\  <  L  for  all  f7  e  J Then  there  exist  constants  C 
0  <  fi  <  1  independent  of  n  such  that 

u  <  CL(3n. 

A  related  problem  is  as  follows:  Consider  an  n  x  (n  +  k)  matrix  F  =  [gi,g2,  ■■■  ,9n+k\- 
min{<Tn(i?s)  :  k},  S'  =  n}.  Assume  that  all  <  1.  How  large  can  r  be?  For 

have  already  seen  that  it  is  bounded  from  above  by  C/y/n.  The  preliminary  work  [18]  shows  that 

3 

it  is  bounded  from  above  by  C/n5. 

Conjecture  5.2.  There  exists  a  constant  C  =  C{k,  n)  such  that 


where  C{k,n)  =  Ok( log'?fc  n)  for  some  qk  >  0.  Here  Ok  denotes  the  dependence  on  k. 

Thus  in  the  minimal  setting  with  m  =  2n  —  1  it  is  impossible  to  achieve  scale  independent  stability  for 
phaseless  reconstruction.  The  same  arguments  can  be  used  to  show  that  even  when  m  =  2n  +  fcp  for  some 
fixed  k0  scale  independent  stability  is  not  possible.  A  natural  question  is  whether  scale  independent  stability 
is  possible  when  we  increase  the  redundancy  of  the  frame.  As  it  turns  out  this  is  possible  via  a  recent  work 
by  Wang  [17].  More  precisely,  the  following  result  follows  from  the  main  results  in  [17]: 

Theorem  5.3.  Let  rp  >  2  and  let  F  =  -k=G  where  G  is  an  n  x  m  random  matrix  whose  elements  are  i.i.d. 

u  v™ 

normal  iV(0, 1)  random  variables  such  that  m/n  =  rp.  Then  there  exist  constants  0  <  Ao  <  u>o  dependent 
only  on  rp  and  not  on  n  such  that  with  high  probability  we  have 


>  0  and 


Let  r  = 
k  —  1  we 
for  k  =  1 


Z\  >  Zlpi  w  >  ccp. 

Proof.  Theorem  1.1  and  Theorem  3.1  of  [17]  proves  the  following  result:  Let  A  >  A  >  1  be  fixed.  Assume 
that  A  =  (r^Lt  where  B  is  an  n  x  A  random  Gaussian  matrix  with  i.i.d.  iV(0, 1)  entries  such  that  N/n  =  A. 
Then  there  exists  a  constant  c  >  0  depending  only  on  rp,  A  and  A  such  that 


min 

SC{l,...,JV},|S|>zln 


<Xn(AS)  >  c 


with  probability  at  least  1  —  3e~Ton.  The  value  c  was  explicitly  estimated  in  terms  of  rp,  A  and  A  in  the 
proof  of  Theorem  3.1  in  [17]. 

The  theorem  now  readily  follows.  Observe  that  because  rp  >  2,  in  the  expression  for  A  we  may  choose 
A  =  r0Z\=^>l  and  clearly  we  have 


A  > 


min 

SC{l,...,N},\S\>An 


&n(Fs)  >  Ao, 


for  some  Z\p  >  0  independent  of  n.  For  oj  we  may  choose  A  =  rp  and  A  —  rp  —  1  >  1.  Again  the  theorem  of 
[17]  implies  that 


oj  >  min  crn(Fs )  >  too- 

SC{l,...,N},\S\>A?i 


□ 


In  the  theorem  the  values  Ao  and  uio  can  be  estimated  explicitly  using  the  estimates  in  [17].  Here  with 
high  probability  is  in  the  standard  sense  that  the  pr&biibility  is  at  least  1  —  coe~l3n  for  some  cp,  /3  >  0.  Thus 
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Fig.  1.  Plots  of  sample  medians  of  A  and  lj  (left  plot)  and  A  and  a,  y/2 a  (right  plot)  for  randomly  generated  frames  of  size  m  =  2 n. 
(For  interpretation  of  the  references  to  color  in  this  figure,  the  reader  is  referred  to  the  web  version  of  this  article.) 


Fig.  2.  Plots  of  largest  sample  value  of  A  and  uj  (left  plot)  and  A  and  <7,  y/2cr  (right  plot)  for  randomly  generated  frames  of  size 
m  =  2n.  (For  interpretation  of  the  references  to  color  in  this  figure,  the  reader  is  referred  to  the  web  version  of  this  article.) 


scale  independent  stable  phaseless  reconstruction  is  possible  whenever  the  redundancy  is  greater  than  2  + A, 
A  >  0,  at  least  for  random  Gaussian  matrices. 

6.  Numerical  examples 

In  this  section  we  present  two  numerical  studies  of  the  stability  bounds  derived  earlier. 

1.  First  consider  the  following  setup.  For  each  n  between  2  and  14  we  generate  100  realizations  of  random 
frames  of  m  =  2n  vectors  where  each  entry  is  i.i.d.  normally  distributed  with  zero  mean  and  unit  variance. 
For  each  realization  we  compute  A,  ui  and  a.  For  each  fixed  n  we  compute  the  sample  median,  the  largest 
sample  value  and  the  smallest  sample  value  of  these  random  variables. 

Fig.  1  contains  the  plots  of  sample  medians  of  A,  u  and  cr’s  for  each  value  of  n.  The  left  plot  contains 
only  A  (the  lower  Lipschitz  constant)  and  oj  (the  lower  Lipschitz  constant  for  small  perturbations);  the  right 
plot  contains  A  and  the  two  bounds  a  and  \/2cr  as  obtained  in  [9].  Similar  statistics  are  plotted  in  Fig.  2 
where  sample  medians  are  replaced  by  the  largest  sample  values,  and  in  Fig.  3  where  sample  medians  are 
replaced  by  smallest  sample  values. 

Note  there  is  about  1-2  orders  of  magnitude  spread  between  the  largest  and  the  smallest  sample  value 
of  these  random  variables.  258 
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Fig.  3.  Plots  of  largest  sample  value  of  A  and  cj  (left  plot)  and  A  and  a,  y/2cr  (right  plot)  for  randomly  generated  frames  of  size 
m  =  2 n.  (For  interpretation  of  the  references  to  color  in  this  figure,  the  reader  is  referred  to  the  web  version  of  this  article.) 


2.  Next  we  consider  the  following  specific  example.  H  =  R2,  m  =  4  and  the  frame  containing 


1 

0 

1 

1 

fl  = 

0 

,  h  = 

1 

j  h  — 

1 

,  U  = 

-1 

which  is  a  tight  frame  of  bounds  A  =  B  =  3.  The  frame  is  full  spark  hence  phase  retrievable.  The  bounds 
A  and  lu  defined  by  (4.4)  and  (4.5)  are  given  by 

A  =  1/3  -V5  =  0.874032,  u  =  1 

which  corresponds  to  choices  S  =  {1,3}  and  S  =  {1,2,3},  respectively.  The  parameters  cr  introduced  in 
(4.16)  is  given  by 

-  =  0.618034 

and  corresponds  to  S  =  {1,3}.  The  parameter  r  introduced  in  the  statement  of  Theorem  4.2  is  given  by 

the  same  expression,  r  =  a  =  \J '3~2V^  =  0.618034  and  corresponds  to  the  same  selection  S  =  {1, 3}. 

Tedious  algebra  can  provide  closed  form  expressions  for  p£(x)  as  function  of  radius  e.  Because  of  scaling 
relation  pce(cx)  =  pe{x)  for  all  c  >  0  it  follows  that  only  the  direction  of  x  describes  this  function.  For 
instance  for  =  (1,0)  we  obtain  the  following  expression: 


P£(x(1))  =  ' 

For  x W  =  (1, 1)  we  obtain: 

Pe(x(2) )  = 

The  plots  of  these  two  functions  are  depicted  in  Fi{2.5S. 
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Fig.  4.  Plots  of  p£(x (left  plot)  and  p£(x (right  plot)  as  function  of  radius  e.  The  red  circle  is  at  \f~A  =  y/3.  The  horizontal 
dotted  line  is  the  lower  bound  A  =  0.874.  (For  interpretation  of  the  references  to  color  in  this  figure,  the  reader  is  referred  to  the 
web  version  of  this  article.) 


Fig.  5.  Plots  of  p£{x ^)  as  function  of  radius  e.  The  red  circle  is  at  y/~A  =  \/3.  The  horizontal  dotted  line  is  the  lower  bound 
A  =  0.874.  (For  interpretation  of  the  references  to  color  in  this  figure,  the  reader  is  referred  to  the  web  version  of  this  article.) 

Following  the  proof  of  Theorem  4.3  it  follows  the  critical  direction  that  achieves  the  lower  bound  \J~A  is 
given  by  x  —  u  +  v  where  u  and  v  are  the  two  normalized  eigenvectors  associated  to  the  lowest  eigenvalue 
(i.e.  the  lower  frame  bound)  for  {/i,/s}  and  {/2,/4}  respectively.  The  lowest  eigenvalue  is  given  by 
and  the  eigenvectors  are 


\!  5+v/5 

\J  5-V5 

u  = 

l+v^ 

,  v  = 

1-V5 

_  2(5+\/5)  _ 

_  ^2(5-75)  . 

and  thus  the  critical  vector  is 


v  5+V5  \J  5—V5 

-1.3764 

1+^5  1-V5 

0.3249 

sj 2(5+^)  ^2(5-^) 

The  function  pe(x is  computed  numerically  and  is  plotted  in  Fig.  5.  For  reference  we  pictured  a  cir¬ 
cle  at  \[A  =  \/3  and  we  plotted  a  dotted  line  at  2,00=  0.874.  We  remark  in  all  three  cases,  the  limit 
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lim£_>.o  pe(x)  =  VA  =  p0  as  predicted  by  Theorem  4.3.  Furthermore,  min e>o,x  Pe{x)  =  A  =  p as  proved 
in  same  Theorem  4.3. 
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MULTIPLE  AUTHORS  DETECTION:  A  QUANTITATIVE  ANALYSIS  OF 

DREAM  OF  THE  RED  CHAMBER 

XIANFENG  HU,  YANG  WANG,  AND  QIANG  WU 


Abst  ract.  Inspired  by  the  authorship  controversy  of  Dream  of  the  Red  Chamber  and  the  applica¬ 
tion  of  machine  learning  in  the  study  of  literary  stylometry,  we  develop  a  rigorous  new  method  for 
the  mathematical  analysis  of  authorship  by  testing  for  a  so-called  chrono-divide  in  writing  styles. 
Our  method  incorporates  some  of  the  latest  advances  in  the  study  of  authorship  attribution,  par¬ 
ticularly  techniques  from  support  vector  machines.  By  introducing  the  notion  of  relative  frequency 
as  a  feature  ranking  metric  our  method  proves  to  be  highly  effective  and  robust. 

Applying  our  method  to  the  Cheng-Gao  version  of  Dream  of  the  Red  Chamber  has  led  to  con¬ 
vincing  if  not  irrefutable  evidence  that  the  first  80  chapters  and  the  last  40  chapters  of  the  book 
were  written  by  two  different  authors.  Furthermore,  our  analysis  has  unexpectedly  provided  strong 
support  to  the  hypothesis  that  Chapter  67  was  not  the  work  of  Cao  Xueqin  either. 

We  have  also  tested  our  method  to  the  other  three  Great  Classical  Novels  in  Chinese.  As  expected 
no  chrono-divides  have  been  found.  This  provides  further  evidence  of  the  robustness  of  our  method. 


l.  Int  roduction 

Dream  of  the  Red  Chamber  (iE®  by  Cao  Xueqin  (f  f /f)  is  one  of  China’s  Four  Great 
Classical  Novels.  For  more  than  one  and  a  half  centuries  it  has  been  widely  acknowledged  as 
the  greatest  literary  masterpiece  ever  written  in  the  history  of  Chinese  literature.  The  novel  is 
remarkable  for  its  vividly  detailed  descriptions  of  life  in  the  18th  century  China  during  the  Qing 
Dynasty  and  the  psychological  affairs  of  its  large  cast  of  characters.  There  is  a  vast  literature  in 
Redology ,  a  term  devoted  exclusively  to  the  study  of  Dream  of  the  Red  Chamber ,  that  touches  upon 
virtually  all  aspects  of  the  book  one  can  imagine,  from  the  analysis  of  even  minor  characters  in  the 
book  to  in-depth  literary  study  of  the  book.  Much  of  the  scope  of  Redology  is  outside  the  focus  of 
this  paper. 

The  original  manuscript  of  Dream  of  the  Red  Chamber  began  to  circulate  in  the  year  1759.  The 
problems  concerning  the  text  and  authorship  of  the  novel  are  extremely  complex  and  have  remained 
very  controversial  even  today,  and  they  remain  an  important  part  of  Redology  studies.  Cao,  who 

Key  words  and  phrases.  Dream  of  the  Red  Chamber,  Cao  Xueqing,  Redology,  machine  learning,  support  vector 
machine  (SVM),  recursive  feature  elimination  (RFE),  literary  stylometry,  authorship  authentication,  chrono-divide. 
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died  in  1763-4,  did  not  live  to  publish  his  novel.  Only  hand-copied  manuscripts  -  some  80  chapters 
-  had  been  circulating.  It  was  not  until  1791  the  first  printed  version  was  published,  which  was 
put  together  by  Cheng  Weiyuan  (fMffj7k)  and  Gao  E  (iuj§P)  and  was  known  as  the  Cheng-Gao 
version.  The  Cheng-Gao  version  has  120  chapters,  40  more  than  the  various  hand-copied  versions 
that  were  circulating  at  the  time.  Cheng  and  Gao  claimed  that  this  “complete  version”  was  based 
on  previously  unknown  working  papers  of  Cao,  which  they  obtained  through  different  channels. 
It  was  these  last  40  chapters  that  were  the  subject  of  intense  debate  and  scrutiny.  Most  modern 
scholars  believe  that  these  40  chapters  were  not  written  by  Cao.  Many  view  those  late  additions  as 
the  work  of  Gao  E.  Some  critics,  such  as  the  renowned  scholar  Hu  Shi  ),  called  them  forgeries 

perpetrated  by  Gao,  while  others  believe  that  Gao  was  duped  into  taking  someone  else’s  forgery  as 
an  original  work.  There  is,  however,  a  minority  of  critics  who  view  the  last  40  chapters  as  genuine. 

The  analysis  of  the  authenticity  of  the  last  40  chapters  has  largely  been  based  on  the  examination 
of  plots  and  prose  style  by  Redology  scholars  and  connoisseurs.  For  example,  many  scholars  consider 
the  plotting  and  prose  of  the  last  40  chapters  to  be  inferior  to  the  first  80  chapters.  Others  have 
argued  that  the  fates  of  many  characters  in  the  end  were  inconsistent  with  what  earlier  chapters 
have  been  foreshadowing.  A  natural  question  is  whether  a  mathematical  stylometry  analysis  of  the 
book  can  shed  some  light  on  this  authenticity  debate. 

The  problem  of  style  quantification  and  authorship  attribution  in  literature  goes  at  least  as  far 
back  as  1854  by  the  English  mathematician  Augustus  De  Morgan  [7],  who  in  a  letter  to  a  cler¬ 
gyman  on  the  subject  of  Gospel  authorship,  suggested  that  the  lengths  of  words  might  be  used 
to  differentiate  authors.  In  1897  the  term  stylometry  was  coined  by  the  historian  of  philosophy, 
Wincenty  Lutaslowski,  as  a  catch-all  for  a  collection  of  statistical  techniques  applied  to  questions 
of  authorship  and  evolution  of  style  in  the  literary  arts  (see  e.g.  [12]).  Today,  literary  stylometry 
is  a  well  developed  and  highly  interdisciplinary  research  area  that  draws  extensively  from  a  num¬ 
ber  of  disciplines  such  as  mathematics  and  statistics,  literature  and  linguistics,  computer  science, 
information  theory  and  others.  It  is  a  central  area  of  research  in  statistical  learning  (see  e.g.  [9]). 
A  popular  classic  technique  for  stylometric  analysis  of  authorship  involves  comparing  frequencies 
of  the  so-called  function  words,  a  class  of  words  that  in  general  have  little  content  meaning,  but 
instead  serve  to  express  grammatical  relationships  with  other  words  within  a  sentence.  Although 
this  technique  is  still  widely  used  today,  the  Held  of  literary  stylometry  has  seen  impressive  ad¬ 
vances  in  recent  years,  with  more  and  more  new  and  sophisticated  mathematical  techniques  as  well 
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as  softwares  being  developed.  We  shall  not  focus  on  these  advances  here.  Instead  we  refer  all  inter¬ 
ested  readers  to  the  excellent  survey  articles  by  Juola  [10]  and  Stamatatos  [14]  for  a  comprehensive 
discussion  of  the  latest  advances  in  the  field. 

Although  there  is  a  vast  Redology  literature  going  back  over  100  years,  the  number  of  studies 
of  the  book  based  on  mathematical  and  statistical  techniques  is  surprisingly  small,  particularly  in 
view  of  the  fact  that  such  techniques  have  been  used  widely  in  the  West  for  settling  authorship 
questions.  Among  the  notable  efforts,  Cao  [3]  meticulously  broke  down  a  number  of  function 
characters  and  words  into  classes  according  to  their  functions.  By  analyzing  their  frequencies  in 
the  first  40  chapters,  the  middle  40  chapters  and  the  last  40  chapters,  Cao  concluded  that  the  first 
80  chapters  and  the  last  40  chapters  were  written  by  different  authors.  Zhang  &  Liu  [2]  examined 
the  occurrence  of  240  characters  in  the  book  that  are  outside  the  GB2312  encoding  system,  and 
found  that  210  of  them  have  appeared  exclusively  in  the  first  80  chapters  while  only  20  of  them 
have  appeared  exclusively  in  the  last  40  chapters.  This  led  to  the  same  conclusion  by  the  authors. 
Yue  [1]  studied  the  authorship  by  combining  both  historical  knowledge  and  statistical  tools.  In  the 
study  Yue  tested  two  hypotheses,  that  the  last  40  chapters  were  not  written  by  the  same  author 
or  they  were  written  by  the  same  author.  His  study  focused  on  the  frequencies  of  5  particular 
function  characters,  the  proportion  of  texts  to  poems  in  each  chapter,  and  a  few  other  stylometric 
peculiarities  such  as  the  number  of  sentences  ended  with  the  character  “Ma”  (®i).  Using  various 
statistical  techniques  the  comparisons  led  the  paper  to  draw  the  conclusion  that  it  is  unlikely  that 
the  first  80  chapters  and  the  last  40  chapters  were  written  by  the  same  author.  At  the  same 
time,  using  historic  knowledge  about  the  book  and  the  original  author  Cao  Xueqin,  the  paper  also 
speculated  that  it  was  not  likely  that  the  last  40  chapters  were  created  entirely  by  a  single  different 
author  such  as  Gao  E.  In  the  opposite  direction,  the  studies  of  Chan  [6]  and  Li  &  Li  [11]  concluded 
that  the  entire  book  was  likely  written  by  a  single  author.  The  study  [11]  focused  on  the  usage 
of  functional  characters  while  [6]  examined  the  usage  of  some  eighty  thousand  characters.  Both 
studies  tabulated  the  frequencies  of  the  selected  characters,  which  led  to  a  frequency  vector  for 
each  of  the  first  40  chapters,  the  middle  40  chapters  and  the  last  40  chapters.  The  correlations 
of  these  frequency  vectors  were  computed.  In  [11]  the  correlations  were  found  to  be  large  enough 
for  the  authors  to  conclude  that  the  entire  120  chapters  of  the  book  were  written  by  the  same 
author.  In  [6]  a  fourth  frequency  vector  using  parts  of  the  book  The  Gallant  Ones 
was  added  for  comparison.  The  author  found  significantly  higher  correlations  among  the  first  three 
frequency  vectors  from  chapters  of  Dream  of  the  Red  Chamber  than  the  correlations  between  the 
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fourth  frequency  vector  and  the  first  three.  This  fact  formed  the  basis  of  the  conclusion  by  the 
author  that  all  120  chapters  were  written  by  a  single  author.  A  different  conclusion  was  reached 
by  Li  [4],  By  analyzing  the  frequencies  of  47  functional  characters  and  applying  several  statistical 
tests  the  author  conjectured  that  the  last  40  chapters  were  put  together  by  Gao  E  using  unedited 
and  unfinished  manuscripts  by  Cao  Xueqin  and  his  family  members. 

Although  some  of  these  aforementioned  studies  are  impressive  in  their  scopes,  missing  conspicu¬ 
ously  from  the  Redology  literature  are  studies  based  on  the  latest  advances  in  literary  stylometry, 
particularly  some  of  the  new  and  powerful  methods  from  machine  learning  theory.  While  comparing 
the  frequencies  of  function  characters  and  words  is  clearly  a  viable  way  to  analyze  the  authorship 
question,  care  needs  to  be  taken  to  account  for  random  fluctuations  of  these  frequencies,  especially 
when  some  of  the  function  characters  and  words  used  for  comparison  have  limited  occurrences 
overall  in  the  book  and  some  times  not  at  all  in  some  chapters.  None  of  the  aforementioned  studies 
employed  cross  validation  to  address  random  fluctuations.  We  have  substantial  reservations  about 
drawing  conclusions  from  correlations  alone  as  in  the  studies  of  Chan  [6]  and  Li  &  Li  [11],  because 
the  differentiating  power  of  any  single  variable  such  as  correlation  is  rather  limited.  It  would  be 
interesting  to  see  a  more  comprehensive  study  of  correlations  on  a  large  corpus  of  texts  in  Chinese 
to  determine  its  effectiveness  as  a  metric  for  authorship  attribution,  something  the  authors  failed 
to  do  in  both  studies.  The  use  of  the  book  The  Gallant  Ones  in  [6]  for  benchmark  comparison 
is  curious  to  us  in  particular,  especially  considering  that  the  author  did  not  limit  to  just  function 
characters.  The  two  books  are  of  two  different  genres  and  are  different  in  their  respective  back¬ 
ground  settings.  Considering  these  differences  and  the  fact  that  The  Gallant  Ones  is  known  not  to 
be  written  by  Cao  Xueqin,  it  would  be  a  shock  if  the  correlation  between  the  last  40  chapters  of 
Dream  of  the  Red  Chamber  and  the  first  80  chapters  is  not  higher  than  the  correlation  between  the 
last  40  chapters  and  The  Gallant  Ones.  It  is  possible  that  the  correlation  computed  in  [6]  tells  more 
about  the  genre  than  the  authorship  of  the  books.  Again,  without  extensive  evidence  that  using  the 
same  technique  the  correlation  between  two  bodies  of  texts  written  by  different  authors  is  generally 
low  even  when  the  plots  are  closely  related,  the  argument  made  in  [6]  is  unconvincing  at  best.  The 
objective  of  this  paper  is  to  present  a  rigorous  stylometric  analysis  of  Dream  of  the  Red  Chamber 
that  incorporates  some  of  the  latest  advances  in  the  study  of  authorship  attribution,  particularly 
techniques  from  the  theory  of  machine  learning.  To  minimize  the  impact  of  random  fluctuations 
we  have  meticulously  followed  well  established  protocols  in  selecting  significant  features  by  proper 
randomization  of  training  and  testing  samples. 
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We  shall  detail  our  methodology  in  the  next  section,  including  feature  construction  and  selection 
techniques  in  machine  learning.  In  Section  3,  we  use  our  approach  to  the  study  of  authorship  of 
Dream  of  the  Red  Chamber  and  show  the  experimental  results.  In  Section  4  we  use  our  approach 
to  analyze  the  other  three  Great  Classical  Novels.  Finally  in  Section  5  we  present  some  additional 
comments  and  our  conclusions. 


2.  Ch  rono-divide  and  Methodology 

The  main  idea  behind  statistically  or  computationally-supported  authorship  attribution  is  that 
by  measuring  some  textual  features  we  can  distinguish  between  texts  written  by  different  authors. 
Nearly  a  thousand  different  measures  including  sentence  length,  word  length,  word  frequencies, 
character  frequencies,  and  vocabulary  richness  functions  had  been  proposed  thus  far  [13]  over  the 
years.  Some  of  these  measures,  such  as  frequencies  of  function  words,  have  proven  effective  while 
others,  such  as  length  of  words,  have  proven  less  effective  [10].  The  field  of  literary  stylometry  has 
seen  impressive  advances  over  the  years,  and  has  become  an  increasingly  important  research  field 
in  the  digital  age  with  the  explosion  of  texts  online. 

This  paper  focuses  on  a  particular  class  of  authorship  controversies,  in  which  there  is  a  suspected 
change  of  authorship  at  some  point  of  a  book.  In  other  words,  one  suspects  that  the  first  X  chapters 
of  a  book  were  written  by  one  author  while  the  remaining  Y  chapters  were  written  by  another. 
Clearly,  the  authorship  controversy  for  Dream  of  the  Red  Chamber  falls  into  this  category.  Since 
no  two  authors  have  exactly  the  same  writing  style,  no  matter  how  similar  they  might  be,  a  book 
written  in  such  a  fashion  would  have  a  stylistic  discontinuity  going  from  Chapter  X  to  Chapter 
X  +  1.  If  we  can  quantify  the  styles  of  the  two  authors  by  a  stylometric  function  S(n)  (a  classifier) 
where  n  denotes  chapters,  or  chronologically  ordered  samples,  of  the  book  in  question,  this  stylistic 
discontinuity  will  appear  as  a  dividing  point  in  the  stylometric  function  S{n)  going  from  n  =  X  to 
n  =  X  +  1.  Because  the  samples  are  ordered  by  time,  we  shall  call  this  divide  in  the  stylometric 
function  S(n)  a  chrono-divide  in  style,  or  simply  a  chrono-divide.  This  paper  develops  a  technique 
for  verifying  and  detecting  chrono-divides  in  books  or  other  body  of  texts.  Knowing  X  and  Y,  as 
it  is  the  case  with  Dream  of  the  Red  Chamber,  can  help  validating  the  conclusion  but  is  not  always 
necessary  for  our  method.  Our  method  does  not  apply  to  any  body  of  texts  where  two  authors 
share  the  writing  in  an  interwoven  way  without  a  chrono-divide. 
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The  underlying  principle  of  our  study  is  that  if  a  book  is  in  fact  written  by  two  authors  A  and  B, 
then  there  should  exist  a  group  of  features  that  characterize  the  difference  of  their  respective  styles. 
These  features  will  lead  to  a  stylometric  function  that  separate  the  book  into  two  different  classes. 
In  the  rest  of  the  paper  we  shall  use  the  more  conventional  term  classifier  for  such  a  stylometric 
function.  The  foundational  principle  for  literary  stylometry  is  built  around  finding  such  classifiers. 
Suppose  that  a  chrono-divide  in  style  exists.  Then  an  effective  classifier  will  show  a  break  point 
somewhere  in  the  middle  of  the  book,  before  and  after  which  the  classifier  gives  positive  values  and 
negative  values,  respectively.  Thus  in  analyzing  a  book  suspected  to  be  written  by  two  authors  with 
a  chrono-divide,  one  can  look  for  a  classifier  that  gives  rise  to  such  a  break  point.  The  existence 
of  such  a  classifier  will  provide  strong  support  for  the  two-author  hypothesis.  Conversely,  if  such 
a  classifier  cannot  be  found  then  we  can  confidently  reject  the  two-author  with  a  chrono-divide 
hypothesis. 

We  use  function  characters  and  words  to  build  and  select  a  group  of  stylometric  features  having 
the  highest  discriminative  power,  and  from  which  we  construct  our  classifier.  We  shall  detail  our 
method  in  the  following  subsections. 

2.1.  Initial  stylometric  feature  extraction.  Suppose  the  book  in  question  is  suspected  to  be 
written  by  two  authors.  For  simplicity  we  shall  call  the  part  written  by  author  A  Part  A  and  the 
part  written  by  author  B  Part  B.  In  many  cases,  such  as  with  Dream  of  the  Red  Chamber ,  both 
Part  A  and  part  B  are  known.  In  some  cases,  they  are  not  precisely  known.  However,  for  books 
suspected  to  have  a  chrono-divide  from  authorship  change,  there  is  usually  a  good  estimate  for 
where  the  divide  is.  Typically  the  first  few  chapters  can  be  confidently  attributed  to  A  and  the  last 
few  chapters  to  B. 

We  begin  by  choosing  a  feature  set  consisting  of  the  kinds  of  features  that  might  be  used  consis¬ 
tently  by  a  single  author  over  a  variety  of  writings.  Typically,  these  features  include  the  frequencies 
of  words  (or  characters  for  books  in  Chinese),  phrases,  mean  and  variation  of  sentence  length, 
and  frequencies  of  direct  speeches  and  exclamations,  and  others.  In  our  analysis,  to  get  a  better 
understanding  of  an  author’s  writing  style,  we  first  find  the  most  frequently  used  characters  and 
words  in  the  book,  e.g.  we  would  find  the  500  most  frequently  used  characters  in  the  whole  book, 
from  which  we  pick  out  only,  say,  n  function  characters.  We  choose  m  words  (combinations  of 
characters)  among  the  300  most  frequently  used  words  in  the  same  way.  An  important  point  is 
that  by  selecting  only  function  characters  and  words  we  obtain  a  selection  of  characters  and  words 
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that  are  content  independent.  This  leads  to  an  initial  set  of  features  consisting  of  the  frequencies 
of  the  n  characters  and  the  m  words,  plus  the  mean  and  variance  of  sentence  length  as  well  as 
the  frequencies  of  direct  speeches  and  exclamations.  These  features  will  be  computed  over  given 
sample  texts  of  the  book  (e.g.  chapters).  We  normalize  each  sample  text  in  the  following  way:  set 
the  median  of  the  mean  and  variation  of  sentence  length  and  the  frequencies  of  direct  speeches, 
exclamations,  n  characters  and  m  words  in  each  work  of  A  and  B  to  be  1.  For  each  sample,  we 
now  get  n  +  m  +  4  features. 

2.2.  Data  preparation.  Having  constructed  the  appropriate  feature  vectors,  we  build  a  distin¬ 
guishing  model  through  a  machine  learning  algorithm.  To  do  so  requires  careful  data  preparation. 
Since  we  usually  have  in  hand  only  limited  samples  while  the  number  of  features  will  be  very 
large,  building  a  model  directly  on  the  entire  book  will  easily  lead  to  over-fitting.  To  overcome 
the  over-fitting  problem,  we  use  the  standard  technique  of  separating  the  whole  data  into  samples 
consisting  of  training  data  and  test  data.  Our  model  will  be  established  based  only  on  the  training 
data  while  its  performance  is  tested  over  the  independent  test  data.  If  we  know  Part  A  and  Part 
B  already  then  a  subset  of  each  can  be  designated  as  training  data.  For  books  suspected  to  have 
a  chrono-divide  in  style,  the  training  data  will  consist  of  the  first  few  chapters  and  the  last  few 
chapters.  The  rest  of  the  book  will  be  used  as  test  data. 

In  order  to  obtain  more  training  sets  and  testing  sets  we  shall  chunk  the  book  in  question  into 
smaller  pieces  of  sample  texts  of  relatively  uniform  size  and  style.  In  all  the  books  we  have  studied, 
we  have  kept  the  sample  texts  to  be  at  least  1000  characters  long.  In  the  case  of  Dream,  of  the  Red 
Chamber  each  sample  text  is  a  chapter. 

2.3.  Feature  subset  selection.  When  we  build  authorship  analysis  the  model  using  the  training 
data  only,  we  do  not  use  all  the  features  (n  +  m  +  4  features).  Instead  we  start  out  with  all  of 
them,  but  eventually  select  a  subset  of  features  that  achieves  the  highest  discriminative  powers. 
Feature  subset  selection  has  been  well  understood  for  high  dimensional  data  analysis  in  the  machine 
learning  context.  First,  the  number  of  discriminative  features  may  be  small  because  the  number  of 
features  an  author  uses  in  a  consistently  different  way  from  others  is  usually  not  very  big.  Moreover, 
the  classifier  can  perform  very  poorly  if  too  many  irrelevant  features  are  included  into  the  model. 
In  this  paper  we  will  use  Support  Vector  Machines  Recursive  Feature  Elimination  (SVM-RFE) 
introduced  in  [8]  to  realize  feature  selection. 
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SVM-RFE  is  a  feature  ranking  method.  Given  the  a  set  of  samples  we  can  use  linear  SVM  to 
build  a  linear  classifier.  It  ranks  the  importance  of  the  features  according  to  their  weights.  As 
mentioned  above,  because  of  large  feature  size  and  small  sample  size,  the  classifier  might  not  be 
robust.  In  addition,  the  high  correlation  between  the  features  may  result  in  small  weights  for 
relevant  features.  Thus  the  ranking  by  SVM  classifier  directly  may  be  inaccurate.  In  order  to  refine 
the  ranking,  the  least  important  feature  is  removed  and  the  linear  SVM  classifier  is  retrained.  This 
new  classifier  provides  a  refined  ranking  for  the  remaining  features.  The  process  is  then  repeated 
until  the  ranking  of  all  features  are  refined.  This  is  the  SVM-RFE  method  introduced  [8].  The 
idea  underlying  SVM-RFE  is  that  in  each  repeat,  although  the  overall  ranking  may  be  poor,  the 
least  important  feature  is  seldom  a  relevant  one.  By  iteratively  eliminating  the  least  important 
features  the  new  classifiers  will  become  more  and  more  reliable  and  hence  will  provide  better  and 
better  ranking.  In  the  application  of  gene  expression  data  analysis  SVM-RFE  has  been  proven  to 
be  substantially  superior  to  the  SVM  direct  ranking  without  RFE. 

However  in  general  SVM-RFE  is  not  stable  under  the  perturbation  of  samples.  A  small  change 
in  samples  may  result  in  very  different  feature  ranking.  There  are  two  possible  reasons.  One  is  that 
the  highly  correlated  variables  are  too  sensitive  and  may  be  ranked  in  different  orders  by  different 
classifiers.  Another  is  that,  due  to  the  randomness,  some  subset  of  samples  might  be  singular  in  the 
sense  that  they  are  less  representative  for  the  whole  data  structure.  As  a  result  the  SVM  classifiers 
are  over-fitting  and  the  feature  ranking  by  SVM-RFE  is  therefore  unreliable.  The  first  situation  is 
less  harmful  for  classification  performance  while  the  second  is  vital.  To  overcome  this  phenomenon 
and  guarantee  the  stability  of  the  ranking,  we  use  a  pseudo-aggregation  technique.  We  randomly 
choose  a  subset  of  training  samples  to  run  SVM-RFE  to  select  the  top  important  features.  This 
process  is  repeated  tens  or  hundreds  times  and  only  those  features  that  appear  important  very 
frequently  are  deemed  as  truly  important  ones.  This  removes  the  randomness  and  results  in  a 
much  more  reliable  ranking. 

With  this  ranking  of  features,  we  can  conclude  which  statistics  are  useful  for  quantifying  the 
writing  style.  We  use  cross  validation  to  select  the  number  of  features  included  in  the  final  classifi¬ 
cation  model.  This  group  of  features  is  a  stable  and  most  discriminative  subset  of  features.  A  final 
classifier  is  built  to  classify  the  test  data. 

2.4.  Data  analysis.  The  classifier  we  have  built  is  used  to  analyze  the  authorship  question.  We 
examine  the  discriminative  power  of  the  classifier  on  the  training  data.  If  it  cannot  even  reliably 
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classify  the  training  data  we  can  convincingly  reject  the  two-author  hypothesis.  Even  if  it  can  the 
telling  story  will  be  whether  it  can  classify,  or  detect  a  chrono-divide,  from  the  test  data.  If  it  fails 
then  again  we  should  reject  the  two-author  hypothesis.  On  the  other  hand,  if  the  the  classifier 
classifies  the  training  data,  and  it  can  also  classify  the  test  data  accurately  or  detect  a  clear  chrono- 
divide,  we  can  then  convincingly  conclude  that  the  book  does  contain  two  different  writing  styles 
and  can  therefore  be  confidently  attributed  to  two  different  authors.  Moreover,  the  feature  subset 
and  the  classifier  describe  the  difference  of  the  two  authors’  writing  styles. 


2.5.  The  algorithm.  In  the  following  we  summarize  the  process  of  our  algorithm: 

(1)  Initialize  the  data  (the  book),  which  contains  parts  A  and  B  suspected  to  be  written  by 
two  different  authors. 

(2)  Split  part  A  and  part  B  into  many  sections  and  extract  the  features  for  each  section  as 
described  in  section  2.1.  This  forms  the  whole  data  set  D,  containing  Da  and  Db- 

(3)  Choose  a  portion  (e.g.  20%-30%)  of  Da  and  Db  respectively  to  form  the  test  data  set  and 
leave  the  remaining  as  the  training  data  set.  The  test  data  will  not  be  used  until  the  final 
model  is  built. 

(4)  Randomly  choose  a  subset  from  the  training  data  as  modeling  data  and  the  rest  (again 
20%-30%)  as  the  validation  data.  Run  SVM-RFE  on  the  modeling  data  and  using  the 
validation  data  to  determine  all  the  parameters  used.  This  provides  a  ranking  of  all  the 
n  +  m  +  4  features  extracted  in  step  2. 

(5)  For  d  range  from  1  to  n  +  m  +  4,  build  a  classifier  using  only  the  top  d  features  and  evaluate 
their  performance  on  the  validation  data.  The  best  model  is  the  one  with  minimal  validation 
error  and  minimal  number  of  top  features.  The  feature  subset  of  this  best  model  is  recorded. 

(6)  Repeat  T  times  step  4  and  step  5  to  obtain  T  best  models  and  T  subsets  of  corresponding 
important  features.  We  recommend  T  to  be  larger  than  50.  Rank  all  the  features  in  these 
subsets  according  to  their  appearance  frequency.  Denote  N  as  the  total  number  of  features 
included. 

(7)  For  d  =  1,...,IV,  using  cross  validation  to  select  the  number  of  features  that  should  be 
included  in  the  final  classifier.  Denote  it  by  d*.  Note  we  require  both  the  cross  validation 
error  and  the  number  of  features  to  be  as  small. 

(8)  Retrain  the  model  using  the  whole  training  set  based  on  this  top  d*  important  features. 
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(9)  Using  the  classifier  to  classifying  the  test  data.  Draw  the  conclusion  according  to  the 
performance. 

Since  our  ranking  process  involves  aggregation  of  large  number  of  models  that  are  trained  using 
SVM-RFE  based  on  different  subsets  of  the  same  data  source,  we  refer  to  our  approach  as  pseudo¬ 
aggregation  SVM-RFE  method. 

3.  Analysis  Of  Dream  of  the  Red  Chamber 

Having  established  a  rigorous  protocol  for  the  study  of  authorship  of  a  body  of  texts,  we  apply 
this  protocol  to  investigate  the  authorship  controversy  of  the  Cheng-Gao  version  of  Dream  of  the 
Red  Chamber.  In  particular  we  investigate  the  existence  of  a  chrono-divide  at  Chapter  80. 

The  book  is  first  divided  into  samples.  To  balance  the  number  of  samples,  we  generate  one 
sample  for  each  of  the  first  80  chapters  while  using  the  conventional  practice  of  duplicating  each 
of  the  last  40  chapters  into  two  chapters  to  obtain  80  samples.  From  those  samples  we  extract  the 
features  by  calculating  the  statistics  proposed  in  subsection  2.1.  These  features  are  then  normalized 
for  fair  comparison.  In  total  we  have  196  variables.  They  are  the  144  characters  and  48  words, 
the  normalized  mean  and  variation  of  sentence  length,  and  the  frequencies  of  direct  speeches  and 
exclamations. 

To  investigate  the  authorship  controversy  we  perform  three  separate  tests.  First  we  build  a 
classifier  for  the  whole  book  and  look  for  the  existence  of  a  chrono-divide  at  Chapter  80.  For  added 
robustness  and  reliability  we  also  perform  the  same  tests  only  on  the  first  80  chapters  and  the  last 
40  chapters. 

3.1.  Separability  of  the  chapters  by  Cao  and  Gao.  In  the  first  experiment  we  apply  our 
method  to  the  whole  Chen-Gao  version  of  Dream  of  the  Red  Chamber.  Samples  from  the  first  60 
chapters  are  designated  as  training  samples  for  one  class  while  samples  from  the  last  30  chapters 
are  designated  as  training  samples  for  another  class.  The  remaining  samples,  from  Chapter  61  to 
90,  are  held  out  as  test  samples.  The  training  samples  are  further  randomly  split  into  modeling 
data  of  80  samples  and  validation  data  of  40  samples.  The  SVM-RFE  is  repeated  100  times  and 
d*  is  chosen  using  50  cross  validation  runs.  We  have  the  following  observations. 

Instability  of  SVM-RFE.  The  randomness  of  the  modeling  set  has  resulted  in  very  substan¬ 
tial  fluctuations  in  the  number  of  features  selected  as  well  as  feature  rankings.  The  resulted 
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Modeling  set 

Features  Selected 

Validation  Error 

1 

ft,  lie,  0,  £n,  W,  W>  W,  W 

5/40 

2 

0,  7f,  WM 

1/40 

Table  1 .  The  features  and  validation  errors  of  the  classifiers  obtained  from  two 
randomly  selected  modeling  subsets. 


classifier  may  also  perform  quite  differently.  Table  3.1  lists  the  features  selected  using  two  dif¬ 
ferent  modeling  data  sets.  One  selects  11  features  and  the  other  selects  only  4,  with  only  one 
feature  in  common.  The  classifiers  also  perform  differently.  The  experiments  clearly  establish 
the  instability  of  SVM-REF. 

Given  such  instability  one  cannot  reliably  draw  any  conclusions  from  any  single  run.  For 
example,  if  a  modeling  data  set  separates  the  training  data  well  it  might  be  due  to  over-fitting. 
Conversely  if  it  separates  poorly  it  might  be  due  to  under-fitting.  This  problem  is  overcome 
with  our  Pseudo  Aggregate  SVM-RFE  method. 


Stability  of  Pseudo  Aggregate  SVM-RFE.  Our  pseudo  aggregate  SVM-RFE  approach 
repeats  SVM-RFE  100  times  using  randomized  data  sets.  The  data  set  from  each  repeat  is  used 
to  select  a  set  of  features,  from  which  a  classifier  is  being  built.  For  simplicity  we  shall  refer  to 
the  data  set,  features  and  the  resulting  classifier  together  from  a  repeat  as  a  model.  To  counter 
random  fluctuations  we  consider  important  features  to  be  those  that  appear  frequently  among 
the  100  classifiers.  This  reduced  the  instability  caused  by  randomness.  In  fact,  our  belief  is  as 
follows:  if  the  two  classes  are  well  separated,  there  should  exist  a  set  of  features  that  help  to 
build  a  good  classifier.  Most  modeling  subsets  should  be  able  to  select  these  features  out  and 
only  a  limited  number  of  modeling  sets  might  be  singular  and  miss  them.  Conversely,  if  the 
two  classes  cannot  be  well  separated,  no  consistently  discriminative  features  exist.  Different 
modeling  set  may  lead  to  totaly  different  feature  subset.  As  a  result,  no  feature  appears  with 
high  frequency  in  all  100  models.  This  philosophy,  however,  is  only  partially  true.  When 
the  two  classes  cannot  be  separated,  the  modeling  process  sometimes  can  overfit  the  data  by 
selecting  a  lot  of  variables  which  results  in  high  absolute  frequencies  for  some  less  important  or 
irrelevant  features.  Such  a  phenomenon  is  usually  accompanied  by  large  number  of  variables 
and  low  validation  accuracy.  To  improve  the  process  we  propose  a  more  appropriate  metric, 
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(3.1) 


which  we  call  relative  frequency.  In  relative  frequency  we  weight  the  frequency  by  two  criteria. 
In  the  first  criteria  a  variable  appearing  in  short  models  is  weighted  more  than  the  variables 
appearing  in  long  models.  This  leads  to  a  weight  of  h{nf)  for  a  variable  in  the  j-th  model,  with 
nj  being  the  number  of  variables  in  the  j-th  model.  In  the  second  criteria  a  variable  in  a  model 
with  high  predictive  accuracy  is  weighted  more  than  a  variable  with  poor  predictive  accuracy. 
This  provides  another  weight  g(Aj)  for  a  variable  in  the  j-th  model,  where  Aj  denotes  the 
accuracy  of  the  j-th  model  computed  from  the  validation  process.  Mathematically  the  relative 
frequency  for  a  variable  x,  in  a  test  run  of  M  repeats  is  defined  as 

1  M 

rf(xi)  =  —  y^g(Aj)h(nj)l(xj  appears  in  model  j). 

3= 1 


In  our  study  we  always  set  M  =  100.  Also,  we  set  g(Aj )  =  exp(  [2a3-i\+  )  where  M+  = 
max{0,  t}  and  h(nj)  =  [1  —  cnj]+  for  some  constant  c.  For  g(Aj)  the  idea  is  that  if  the  weight 
should  decay  fast  if  the  accuracy  is  close  to  50%  or  less  because  it  indicates  that  the  classifier 
is  simply  not  effective.  For  h(nj)  we  put  in  a  penalty  for  the  number  of  variables  used  in  a 
model.  In  our  experiments  we  have  chosen  c  =  1/30,  which  seems  to  work  well. 


Our  experiments  show  that  features  yielded  from  relative  frequency  rankings  are  very  sta¬ 
ble  and  consistent.  We  have  performed  runs  of  100  repeats  using  different  random  seeds  in 
MATLAB,  and  the  results  are  always  similar.  An  additional  benefit  of  using  relative  frequency 
instead  of  absolute  frequency  is  that  the  existence  of  an  effective  classifier  is  typically  accom¬ 
panied  by  high  relative  frequencies  for  the  top  features,  while  low  relative  frequencies  for  the 
top  features  usually  imply  poor  separability.  Hence  we  can  use  relative  frequency  as  a  simple 
guide  on  the  separability  of  the  samples.  We  will  show  some  examples  in  the  next  section. 


Results  and  conclusion.  In  Experiment  1  we  have  performed  a  run  of  100  repeats  on  the 
entire  Cheng-Gao  version  of  Dream  of  the  Red  Chamber.  Altogether  70  features  have  appeared 
in  at  least  one  model.  However,  of  those  only  a  small  number  of  them  have  appeared  with 
high  enough  frequency  to  be  viewed  as  being  important.  We  apply  cross  validation  to  select 
the  number  of  features,  and  the  mean  cross  validation  error  rate  against  different  number  of 
features  is  plotted  in  Figure  1  (a).  The  figure  tells  us  that  10  to  50  features  are  enough  to 
tell  the  style  difference  between  the  two  parts.  Using  less  characters  and  words  is  insufficient, 
while  using  more  degrades  the  performance  also  by  bringing  in  too  much  noise.  The  small 
cross  validation  error  rate  is  encouraging,  and  it  is  already  hinting  a  strong  possibility  that 
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the  two  training  sample  sets  have  significant  stylistic  differences  to  support  the  two-author 
hypothesis. 

To  settle  the  two-author  hypothesis  more  definitively  we  apply  our  classifier  on  the  test 
data,  which  until  now  has  never  been  used  during  the  feature  selection  and  classifier  modeling 
process.  In  particular  we  investigate  the  existence  of  a  chrono-divide  in  the  values  obtained 
through  classifier.  Figure  1  (b),  which  plots  these  values,  clearly  shows  a  chrono-divide  at 
Chapter  80:  For  Chapter  81-90  the  classifier  yields  all  negative  values  while  for  Chapters  61- 
80  the  classifier  yields  all  positive  values  with  the  exception  of  Chapter  67.  Allowing  some 
statistical  abberations  to  occur,  our  results  provide  an  extremely  convincing  if  not  irrefutable 
evidence  that  there  exist  clear  stylometric  differences  between  the  writings  of  the  first  80 
chapters  and  the  last  40  chapters.  This  difference  strongly  supports  the  two-author  hypothesis 
for  Dream  of  the  Red  Chamber.  We  also  note  that  our  investigation  did  not  need  to  assume 
that  the  knowledge  that  the  stylistic  change  should  be  at  Chapter  80.  The  fact  that  the 
chrono-divide  we  have  detected  is  indeed  at  Chapter  80  lends  even  stronger  support  to  the 
two-author  hypothesis. 


(a)  (b) 


Figure  1.  Experiment  1:  (a)  Mean  cross  validation  error  rate;  (b)  Values  of  SVM 
classifier  on  chapters  60-90. 
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Interestingly,  the  fact  that  Chapter  67  appeared  as  an  “outlier”  in  our  classification  serves  as 
further  evidence  to  the  validity  of  our  analysis.  It  was  only  after  the  tests  we  realized  that  the 
authorship  of  Chapter  67  itself  is  one  of  the  controversies  in  Redology.  Unlike  the  main  controversy 
about  the  authorship  of  the  first  80  chapters  and  the  last  40  chapters,  experts  are  less  unified  in 
their  positions  here.  Again,  our  results  strongly  suggests  that  Chapter  67  is  stylistically  different 
from  the  rest  of  the  first  80  chapters,  and  it  may  not  be  written  by  Cao.  Our  finding  is  consistent 
with  the  conclusion  of  [5]. 

3.2.  Non-separability  of  the  first  80  chapters.  To  further  validate  our  method  we  apply  the 
same  tests  to  the  first  80  chapters  of  Dream  of  the  Red  Chamber  to  see  whether  we  can  get  a 
chrono-divide  (Experiment  2).  We  use  the  first  30  and  last  30  chapters  as  the  training  data  and 
leave  chapters  31-50  as  the  test  data.  Figure  2  shows  the  mean  cross  validation  error  and  the  values 
of  SVM  classifier  on  the  test  data  chapters  31-50.  The  experiment  shows  many  more  features  have 
been  selected  in  the  100  repeats,  implying  the  difficulty  of  find  a  consistent  subset  of  discriminative 
features.  The  large  errors  on  the  training  data  also  indicate  the  difficulty  for  separation.  When 
the  classifier  is  applied  to  the  test  data,  there  is  clearly  no  chrono-divide.  This  suggests  that  our 
method  yields  a  conclusion  that  is  completely  consistent  with  what  is  known. 


(a)  (b) 

Figure  2.  Experiment  2:  (a)  Mean  cross  validation  error  rate;  (b)  Values  of  SVM 
classifier  on  chapters  31-50.  Note  there  is  no  chrono-divide. 
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3.3.  Analysis  of  chapters  81-120:  style  change  over  time.  We  next  apply  our  method  to  the 
last  40  chapters  (Experiment  3).  Our  first  experiment  has  already  confirmed  that  they  are  unlikely 
to  be  written  by  Cao.  However,  there  are  still  debates  on  whether  these  were  written  entirely  by 
one  author  (most  likely  Gao  himself),  or  by  more  than  one  author.  Our  mathematical  analysis  may 
offer  some  insight  here. 

We  split  the  40  chapters  into  two  subsets  as  before.  The  training  data  include  Chapters  81-95  as 
one  class  and  Chapters  106-120  as  another.  The  test  data  are  the  middle  10  chapters.  Because  of 
the  relatively  small  number  of  samples  we  have  subdivided  each  chapter  into  2  sections  to  increase 
the  sample  size.  As  a  result  we  now  have  60  samples  in  the  training  data  and  20  in  test  data,  with 
2  samples  corresponding  to  one  chapter.  The  mean  cross  validation  error  of  the  final  classifier  and 
its  classification  values  on  the  test  samples  are  shown  in  Figures  3  (a)  and  (b)  respectively. 

In  this  experiment  we  observe  that  the  performance  in  terms  of  both  the  classifier  and  feature 
ranking  is  noticeably  worse  than  that  in  Experiment  1  but  substantially  better  than  that  in  Ex¬ 
periment  2.  Furthermore,  unlike  the  results  from  the  first  two  experiments,  the  values  from  the 
classifier  show  an  interesting  trend.  Compared  with  Figure  2  (b)  where  the  values  appeared  to  lack 
any  order,  the  values  here  exhibit  a  clear  gradual  downward  shift.  On  the  other  hand,  compared  to 
Figure  1  (b)  the  values  plotted  in  3  (b)  do  not  show  a  clear  sharp  chrono-divide,  even  though  the 
values  change  gradually  from  being  positive  to  being  negative.  What  it  tells  us  is  that  the  writing 
style  of  the  last  80  chapters  had  undergone  a  graduate  change,  but  this  change  is  unlikely  to  be 
due  to  change  of  authorship. 

Our  results  here  could  be  subject  to  several  interpretations.  One  plausible  interpretation  is  that 
Gao  might  indeed  obtained  some  incomplete  set  of  manuscripts  by  Cao,  and  tried  to  complete  the 
novel  based  on  what  he  had  obtained.  The  style  change  is  a  result  of  the  lack  of  genuine  work 
by  Cao  as  the  story  developed.  A  more  plausible  interpretation  is  that  the  last  40  chapters  were 
written  by  someone  such  as  Gao  trying  to  imitate  Cao’s  style,  and  over  time  the  author  became 
sloppier  and  returned  more  and  more  to  his  own  style. 
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(a)  (b) 


Figure  3.  Experiment  3:  (a)  Mean  cross  validation  error  rate;  (b)  Values  of  SVM 
classifier  on  chapters  96-105,  which  correspond  to  the  samples  31-50  in  all  80  samples. 
Note  two  samples  come  from  one  chapter  in  this  experiment. 


4.  Analysis  of  the  other  three  Great  Classical  Novels 

To  further  bolster  the  credibility  of  our  approach  we  test  our  method  on  the  other  three  Great 
Classical  Novels  in  Chinese  literature,  Romance  of  the  Three  Kingdoms  Water  Margin 

(zKf/pf^),  and  Journey  to  the  West  (MlStiH).  Unlike  Dread  of  the  Red  Chamber ,  there  is  no 
authorship  controversy  for  these  other  three  novels.  Thus  if  our  method  is  indeed  robust  we  should 
expect  negative  answers  for  the  two-author  hypotheses  for  all  of  them  by  finding  no  chrono-divides. 

As  with  Dream  of  the  Red  Chamber ,  we  split  each  of  the  three  novels  into  training  samples  and 
test  samples.  Both  Romance  of  the  Three  Kingdoms  and  Water  Margin  have  120  chapters.  In  both 
cases  we  designate  the  first  30  chapters  and  the  last  30  chapters  as  the  two  classes  of  training  data, 
and  the  middle  60  chapters  as  test  data.  For  Journey  to  the  West  the  two  classes  of  training  data 
are  the  first  and  last  25  chapters  respectively,  with  the  middle  50  chapters  as  test  data. 

We  use  the  same  procedure  to  test  for  chrono-divides  on  the  three  novels.  Compared  to  Dream 
of  the  Red  Chamber ,  the  selected  features  show  much  lower  relative  frequencies,  indicating  difficulty 
in  differentiating  between  the  writing  styles.  Table  2  show  the  relative  frequencies  (with  c  =  1/30) 
of  the  top  8  features  for  each  of  the  four  Great  Classical  Novels.  Also  of  note  is  that  in  the  case  of 
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(a) 


(b) 


(c) 


Figure  4 .  Classification  results  from  the  test  sampels  of  the  other  three  classical 
novels:  (a)  Romance  of  the  Three  Kingdoms ;  (b)  Water  Margin ;  (c)  Journey  to  the 
West. 


Water  Margin ,  51  features  are  used  to  build  a  classifier  from  the  60  training  data,  which  is  clearly 
another  strong  indication  of  the  difficulty. 


Novel 

Relative  frequencies  of  top  8  features 

Dream  of  the  Red  Chamber 

0.57 

0.46 

0.43 

0.36 

0.31 

0.30 

0.29 

0.19 

Romance  of  the  Three  Kingdoms 

0.31 

0.27 

0.26 

0.25 

0.23 

0.22 

0.17 

0.15 

Water  Margin 

0.18 

0.17 

0.16 

0.16 

0.14 

0.11 

0.11 

0.10 

Journey  to  the  West 

0.03 

0.03 

0.02 

0.02 

0.02 

0.02 

0.02 

0.02 

Table  2 .  Relative  frequencies  of  the  top  ranked  8  features  in  each  of  the  four  Great 
Classical  Novels. 


Figure  4  plots  the  values  from  the  classifiers  for  all  three  novels.  In  all  cases  the  values  fluctuate 
in  such  a  way  that  it  is  quite  clear  that  no  chrono-divides  exist,  as  expected. 

This  analysis  shows  that  our  approach  can  reliably  reject  the  two-author  hypothesis  when  it  is 
false,  lending  further  support  to  the  effectiveness  and  robustness  of  our  method. 

5.  Conclusions 

Inspired  by  authorship  controversy  of  Dream  of  the  Red  Chamber  and  the  application  of  SVM 
in  the  study  of  literary  stylometry,  we  have  developed  a  mathematically  rigorous  new  method  for 
the  analysis  of  authorship  by  testing  for  a  chrono-divide  in  writing  styles.  We  have  shown  that  the 
method  is  highly  effective  and  robust.  Applying  our  method  to  the  Cheng-Gao  version  of  Dream 
of  the  Red  Chamber  has  led  to  convincing  if  not  irrefutable  evidence  that  the  first  80  chapters  and 
the  last  40  chapters  of  the  book  were  written  by  two  different  authors.  Furthermore,  our  analysis 
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has  unexpectedly  provided  strong  support  to  the  hypothesis  that  Chapter  67  was  not  the  work  of 
Cao  Xueqin  either. 

The  methodology  in  this  paper  is  rather  effective  in  selecting  the  most  important  features  for  clas¬ 
sification  through  a  new  ranking  system  based  on  relative  frequency.  A  series  of  future  experiments 
should  be  included  in  the  application  of  this  methodology  to  wider  range  of  works. 

It  is  worth  mentioning  that  there  are  several  other  attempts  to  complete  Dream  of  the  Red 
Chamber  from  the  its  first  80  chapters,  among  them  is  Continued  Dream  of  the  Red  Chamber 

by  Qi  Ziehen  (fUrp'bfc).  Using  the  same  features  for  building  the  classifier  in  Experiment 
1,  we  can  compute  the  Euclidean  distances  between  all  chapters  and  their  distances  of  chapters 
from  Continued  Dream  of  the  Red  Chamber ,  see  Figure  5.  Surprisingly,  although  these  features 
are  obtained  in  favor  of  the  differences  between  Cao  and  Cheng-Gao,  they  lead  to  even  larger 
distance  between  the  first  80  chapters  and  those  chapters  of  Continued  Dream  of  the  Red  Chamber. 
It  obviously  implies  that  the  style  of  the  40  chapters  by  Cheng-Gao  are  more  similar  to  the  80 
chapters  by  Cao  compared  to  Continued  dream  of  the  Red  Chamber.  Maybe  that’s  why  the  Cheng- 
Gao  version  is  more  popular  than  other  versions. 
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REVISIONIST  INTEGRAL  DEFERRED  CORRECTION 
WITH  ADAPTIVE  STEP-SIZE  CONTROL 

Andrew  J.  Christlieb,  Colin  B.  Macdonald, 

Benjamin  W.  Ong  and  Raymond  J.  Spiteri 

Adaptive  step-size  control  is  a  critical  feature  for  the  robust  and  efficient  numerical 
solution  of  initial-value  problems  in  ordinary  differential  equations.  In  this  paper, 
we  show  that  adaptive  step-size  control  can  be  incorporated  within  a  family  of 
parallel  time  integrators  known  as  revisionist  integral  deferred  correction  (RIDC) 
methods.  The  RIDC  framework  allows  for  various  strategies  to  implement  step- 
size  control,  and  we  report  results  from  exploring  a  few  of  them. 

1.  Introduction 

The  purpose  of  this  paper  is  to  show  that  local  error  estimation  and  adaptive  step- 
size  control  can  be  incorporated  in  an  effective  manner  within  a  family  of  parallel 
time  integrators  based  on  revisionist  integral  deferred  correction  (RIDC).  RIDC 
methods,  introduced  in  [10],  are  “parallel-across-the-step”  integrators  that  can  be 
efficiently  implemented  with  multicore  [10;  6],  multi-GPGPU  [4],  and  multinode 
[9]  architectures.  The  “revisionist”  terminology  was  adopted  to  highlight  that  (1) 
RIDC  is  a  revision  of  the  standard  integral  defect  correction  (IDC)  formulation  [12], 
and  (2)  successive  corrections,  running  in  parallel  but  (slightly)  lagging  in  time, 
revise  and  improve  the  approximation  to  the  solution. 

RIDC  methods  have  been  shown  to  be  effective  parallel  time-integration  methods. 
They  can  typically  produce  a  high-order  solution  in  essentially  the  same  amount 
of  wall-clock  time  as  the  constituent  lower-order  methods.  In  general,  for  a  given 
amount  of  wall-clock  time,  RIDC  methods  are  able  to  produce  a  more  accurate 
solution  than  conventional  methods.  These  results  have  thus  far  been  demonstrated 
with  constant  time  steps.  It  has  long  been  accepted  that  local  error  estimation 
and  adaptive  step-size  control  form  a  critical  part  of  a  robust  and  efficient  strategy 
for  solving  initial-value  problems  in  ordinary  differential  equations  (ODEs),  in 
particular  problems  with  multiple  timescales;  see  [15],  for  example.  Accordingly,  in 
order  to  assess  the  practical  viability  of  RIDC  methods,  it  is  important  to  establish 
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Keywords:  initial-value  problems,  revisionist  integral  deferred  correction,  parallel  time  integrators, 
local  error  estimation,  adaptive  step-size  control. 
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whether  they  can  operate  effectively  with  variable  step  sizes.  It  turns  out  that 
there  are  subtleties  associated  with  modifying  the  RIDC  framework  to  incorporate 
functionality  for  local  error  estimation  and  adaptive  step-size  control:  there  are  a 
number  of  different  implementation  options,  and  some  of  them  are  more  effective 
than  others. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  2,  we  review 
the  ideas  behind  RIDC  as  well  as  strategies  for  local  error  estimation  and  step-size 
control.  We  then  combine  these  ideas  to  propose  various  strategies  for  RIDC  meth¬ 
ods  with  error  and  step-size  control.  In  Section  3,  we  describe  the  implementation 
of  these  strategies  within  the  RIDC  framework  and  suggest  avenues  that  can  be 
explored  for  a  production-level  code.  In  Section  4,  we  demonstrate  that  the  use  of 
local  error  estimation  and  adaptive  step-size  control  inside  RIDC  is  computationally 
advantageous.  Finally,  in  Section  5,  we  summarize  the  conclusions  reached  from 
this  investigation  and  comment  on  some  potential  directions  for  future  research. 


2.  Review  of  relevant  background 

We  are  interested  in  numerical  solutions  to  initial-value  problems  (IVPs)  of  the 
form 

f/(0  =  /(*.  t  e  [a,  b\, 

\y(a)  =  ya. 

where  y(t)  :  IR  —>■  [R"7,  ya  e  [R"7,  and  /  :Rx  Um  — >  [R"!.  We  first  review  RIDC 
methods,  a  family  of  parallel  time  integrators  that  can  be  applied  to  solve  (1).  Then, 
we  review  strategies  for  local  error  estimation  and  adaptive  step-size  control  for 
IVP  solvers. 


2.1.  RIDC.  RIDC  methods  [10;  6;  4]  are  a  class  of  time  integrators  based  on 
integral  deferred  correction  [12]  that  can  be  implemented  in  parallel  via  pipelining. 
RIDC  methods  first  compute  an  initial  (or  provisional)  solution,  typically  using  a 
standard  low-order  scheme,  followed  by  one  or  more  corrections.  Each  correction 
revises  the  current  solution  and  increases  its  formal  order  of  accuracy.  After  initial 
startup  costs,  the  predictor  and  all  the  correctors  can  be  executed  in  parallel.  It  has 
been  shown  that  parallel  RIDC  methods  with  uniform  step-sizes  give  almost  perfect 
parallel  speedups  [10].  In  this  section,  we  review  RIDC  algorithms,  generalizing 
the  overall  framework  slightly  to  allow  for  nonuniform  step-sizes  on  the  different 
correction  levels. 

We  denote  the  nodes  for  correction  level  i  by 


a  =  tip  <  t 


M] 


< 


<  t 


m 


=  b, 


where  denotes  the  number  of  time  steps  on  level  l.  In  practice,  the  nodes  on 
each  level  are  obtained  dynamically  by  the  step-size  controller. 
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2.1.1.  The  predictor.  To  generate  a  provisional  solution,  a  low-order  integrator  is 
applied  to  solve  the  IVP  (1).  For  example,  a  first-order  forward  Euler  integrator 
applied  to  (1)  gives 


do] 


[°]  +(T[0]  _tm  \f(tm  Loj  . 
'In- 1  +  Vn  ^n-V  'In- 1'’ 


40]  _  JO] 


[0]  JO] 


(2) 


for  n  =  1,2, . . . ,  Nx°\  with  t/jj1'  =  ya,  and  where  we  have  indexed  the  prediction 
level  as  level  0.  We  denote  as  a  continuous  extension  [15]  of  the  numerical 

solution  at  level  l ,  i.e.,  a  piecewise  polynomial  r]^ox(t)  that  satisfies 


^o])=e 


The  continuous  extension  of  a  numerical  solution  is  often  of  the  same  order  of 
accuracy  as  the  underlying  discrete  solution  [15].  Indeed,  for  the  purposes  of  this 
study,  we  assume  r^(t)  is  of  the  same  order  as  //)/ 1 . 

2.1.2.  The  correctors.  Suppose  an  approximate  solution  ij(t)  to  IVP  (1)  is  com¬ 
puted.  Denote  the  exact  solution  by  y(t).  Then,  the  error  of  the  approximate 
solution  is  e(t)  =  y(t)  —  rj(t).  If  we  define  the  defect  as  S(t)  =  f(t,  rj(t ))  —  rj'(t), 
then 

e'(t)  =  y'(t)  -  rj'(t)  =  f(t ,  r](t)  +  e(t ))  -  f(t,  ij(t))  +  S(t). 


The  error  equation  can  be  written  in  the  form 

=  f(t,  v(t)  +  e(t))  -  f(t,  r}(t)),  (3) 

subject  to  the  initial  condition  e(a)  =  0.  In  RIDC,  the  corrector  at  level  l  solves 
for  the  error  e|f _^(t)  of  the  solution  '(/)  at  the  previous  level  to  generate  the 
corrected  solution  rj^(t), 

^](0  =  ^-t](0+efi-1](0. 

For  example,  a  corrector  at  level  l  that  corrects  r[l~"(t)  by  applying  a  first-order 
forward  Euler  integrator  to  the  error  equation  (3)  takes  the  form 


e(t) 


8(r)  dr 


~  f{[t]  §l£  l](*)dr  = 

A l)  +  1»  - 

where  A/j/1  =  tj®  —  t^\.  After  some  algebraic  manipulation,  one  obtains 
4*  =  4*  t  +  A tW[f(tl%  i))  -  f(t[%  rj^l i))] 


ftw 

+  "  f{x,^~x\x))dx.  (4) 
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The  integral  in  (4)  is  approximated  using  quadrature, 

|£j?]  | 


/m  ^(T’  ,?[(  11(^T^  dx  ™  ^  11  (r*)).  T  e  $ 

*n-i  i=  1 


orM 
«  ’ 


(5) 


where  the  set  of  quadrature  nodes,  2Tf^,  for  a  first-order  corrector  satisfies 

i.  \$m\=e  +  i, 

O  ofra  c  rJ^-lliiV^-11 
z"  °  n  —  t ln  >n= 0  ’ 

3.  min(#W)  < 

4.  max(2f[/])  >  f,1/1- 

r/ _ in 

The  quadrature  weights,  or  •  ,  are  found  by  integrating  the  interpolating  Lagrange 

polynomials  exactly, 


\^n]\  r,m  ( 

<£"=  n  L 


.  A.  4,.^!  (Tf  -T/) 
7  =  1-7#*  "-1 


(6) 


The  term  /(t^,  t/1'  in  (4)  is  approximated  using  Lagrange  interpolation, 


I#?1 1 


/W-t>^  1](^-t))^I MW*  Mo), 


n  e  M 


(7) 


1  =  1 


where  the  same  set  of  nodes,  2T^,  for  the  quadrature  is  used  for  the  interpolation. 
The  interpolation  weights  are  given  by 


[€— 1] 

Y  ■ 

•  n,i 


1 3)0 1 

n 

7  =  1.7#*' 


(tm 

”n- 1 


l) 


(rf —  r,-) 


r-  g 


(8) 


2.2.  Adaptive  step-size  control.  Adaptive  step-size  control  is  typically  used  to 
achieve  a  user-specified  error  tolerance  with  minimal  computational  effort  by  varying 
the  step-sizes  used  by  an  IVP  integrator.  This  is  commonly  done  based  on  a  local 
error  estimate.  It  is  also  generally  desirable  that  the  step-size  vary  smoothly  over 
the  course  of  the  integration.  We  review  common  techniques  for  estimating  the 
local  error,  followed  by  algorithms  for  optimal  step-size  selection. 

2.2.1.  Error  estimators.  Two  common  approaches  for  estimating  the  local  trunca¬ 
tion  error  of  a  single-step  IVP  solver  are  through  the  use  of  Richardson  extrapolation 
(commonly  used  within  a  step-size  selection  framework  known  as  step  doubling) 
and  embedded  Runge-Kutta  pairs  [15].  Step  doubling  is  perhaps  the  more  intuitive 
technique.  The  solution  after  each  step  is  estimated  twice:  once  as  a  full  step  and 
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once  as  two  half  steps.  The  difference  between  the  two  numerical  estimates  gives 
an  estimate  of  the  truncation  error.  For  example,  denoting  the  exact  solution  to 
IVP  (1)  at  time  tn  +  At  as  y(tn  +  At),  the  forward  Euler  step  starting  from  the  exact 
solution  at  time  tn  and  using  a  step-size  of  size  At  is 


m,n+\  =  y(tn)  +  At  fdn,  ?*), 


and  the  forward  Euler  step  using  two  steps  of  size  At / 2  is 


At  \  At  f  At  At 

m,n+ 1  =  (  y(tn)  +  —  f(tn,  yn) )+— f  Ui  +  — ,  y(tn)  +  —  f  (tn,  y„)  ). 


Because  forward  Euler  is  a  first-order  method  (and  thus  has  a  local  truncation  error 
of  ©(At2)),  the  two  numerical  approximations  satisfy 

y(tn  +  At)  =  ?/i,h+i  +  {At)2  (j)  +  ©(At3)  +  •  •  •  , 
y{tn  +  At)  =  rj2,n+\  +  2 ^ 0  +  ©(At3 )  +  •  •  •  , 


where  a  Taylor  series  expansion  gives  that  0  is  a  constant  proportional  to  y"(tn). 
The  difference  between  the  two  numerical  approximations  gives  an  estimate  for  the 
local  truncation  error  of  rj2,n+i. 


Ai+l  —  7?2,/i+l  7?l,;i-|-I  — 


At2 

~Y 


0  +  ©(At3). 


An  alternative  approach  to  estimating  the  local  truncation  error  is  to  use  embedded 
RK  pairs  [11].  An  s -stage  Runge-Kutta  method  is  a  single-step  method  that  takes 
the  form 

Vn+ 1  =  rjn  +  A  t^biki, 

i= 1 


where 


ki  =  f 


(  ti  +  fjh.  r\n  +  At  ciijkj  J , 
^  j= i  ' 


i  =  1,  2,  . . . ,  s. 


The  idea  is  to  find  two  single-step  RK  methods,  typically  one  with  order  p  and  the 
other  with  order  p  —  1,  that  share  most  (if  not  all)  of  their  stages  but  have  different 
quadrature  weights.  This  is  represented  compactly  in  the  extended  Butcher  tableau 
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Denoting  the  solution  from  the  order-  p  method  as 

Vn+ 1  =  *1/1  +  A t  y^  bjkj, 
i= 1 


and  the  solution  from  the  order-  (p  —  1)  method  as 


the  error  estimate  is 


S 

Vn+ 1  =  tin  +  A  t^biki, 
i= 1 


5 

e„+i  =  -  T]*+l  =  At  y^(b,  -  bi)ki, 

;  =  1 


(9a) 


(9b) 


(9c) 


which  is  0(A tp). 

A  third  approach  for  approximating  the  local  truncation  error  is  possible  within 
the  deferred  correction  framework.  We  observe  that  in  solving  the  error  equation  (3), 
one  is  in  fact  obtaining  an  approximation  to  the  error.  As  discussed  in  Section  3.3, 
it  can  be  shown  that  the  approximate  error  after  l  first-order  corrections  satisfies 
o(At,,,,+l  +  i ).  We  shall  see  in  Section  3.3  that  this  error  estimate  proves  to  be  a 
poor  choice  for  optimal  step- size  selection  because  in  our  formulation  the  time  step 
selection  for  level  £  does  not  allow  for  the  refinement  of  time  steps  at  earlier  levels. 


2.2.2.  Optimal  step-size  selection.  Given  an  error  estimate  from  Section  2.2.1  for 
a  step  At,  one  would  like  to  either  accept  or  reject  the  step  based  on  the  error 
estimate  and  then  estimate  an  optimal  step-size  for  the  next  time  step  or  retry  the 
current  step.  Following  [16],  Algorithm  1  outlines  optimal  step-size  selection  given 
an  estimate  of  the  local  truncation  error.  In  lines  1-4,  one  computes  a  scaled  error 
estimate.  In  line  5,  an  optimal  time  step  is  computed  by  scaling  the  current  time 
step.  In  lines  6-10,  a  new  time  step  is  suggested;  a  more  conservative  step-size  is 
suggested  if  the  previous  step  was  rejected. 


3.  RIDC  with  adaptive  step-size  control 

There  are  numerous  adaptive  step-size  control  strategies  that  can  be  implemented 
within  the  RIDC  framework.  We  consider  three  of  them  in  this  paper  as  well  as 
discuss  other  strategies  that  are  possible. 

3.1.  Adaptive  step-size  control:  prediction  level  only.  One  simple  approach  to 
step-size  control  with  RIDC  is  to  perform  adaptive  step-size  control  on  the  prediction 
level  only,  e.g.,  using  step  doubling  or  embedded  RK  pairs  as  error  estimators  for  the 
step-size  control  strategy.  The  subsequent  correctors  then  use  this  grid  unchanged 
(i.e.,  without  performing  further  step-size  control).  With  this  strategy,  corrector  l  is 
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Input: 

yn :  approximate  solution  at  time  tn ; 

yn+ 1:  approximate  solution  at  time  tn+\ ; 

e„+i:  error  estimate  for  y„+i ; 

p:  order  of  integrator; 

m :  number  of  ODEs; 

atol,  rtol:  user  specified  tolerances; 

prev_rej:  flag  that  indicates  whether  the  previous  step  was  rejected; 
a  <  1:  safety  factor; 

/3  >  1:  allowable  change  in  step-size. 


Output: 

accept_f  lag:  flag  to  accept  or  reject  this  step; 

A tnew:  optimal  time  step 

Set  a(i)  =  max{ | yn (z )  | ,  |y„+i(z')|},  i  =  1,2 - ,  m. 


Compute  r (/)  =  atol  -frtol  i  =  1,2, ,  m. 


Compute  e  = 


T7=  i(^(0/t(z'))2 


v  m 

Compute  A topt  =  At  (i)1/^1). 
if  prev  re j  then 

j  A tnew  =  a  min{At,  max{A topt,  A t//3}} 


else 


j  A tnew  =  Qf  min{/JAt,  max{Ar0/„,  A t//3}} 

end 

if  e  >  1  then 

j  accept_f lag  =  1 
else 

|  accept_f lag  =  0 

end 


Algorithm  1:  Optimal  step-size  selection  algorithm.  The  approximate  solution, 
the  error  estimate,  and  its  order  are  provided  as  inputs.  For  the  numerical 
experiments  in  Section  4,  we  fix  a  =  0.9,  /l  =  10. 


lagged  behind  corrector  l  —  1  so  that  each  node  simultaneously  computes  an  update 
on  its  level  (after  an  initial  startup  period).  This  is  illustrated  graphically  in  Figure  1. 
In  principle,  near  optimal  parallel  speedup  is  maintained  with  this  approach  provided 
the  computational  overhead  for  the  RIDC  method  (i.e.,  the  interpolation,  quadrature, 
and  linear  combination  of  solutions)  is  small  compared  to  the  advance  of  predictor 
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Figure  1.  Schematic  diagram  of  step-size  control  on  the  prediction  level  only.  The  filled 
circles  denote  previously  computed  and  stored  solution  values  at  particular  times.  The 
corrections  are  run  in  parallel  (but  lagging  in  time)  and  the  open  circles  indicate  which 
values  are  being  simultaneously  computed.  The  stencil  of  points  required  by  each  level 
is  shown  by  the  “bubbles”  surrounding  certain  grid  points;  the  thick  horizontal  shading 
indicates  the  integrals  needed  in  (4). 

from  tn  to  tn+ 1;  in  this  implementation,  a  small  memory  footprint  similar  to  [10] 
can  be  used.  Additionally,  an  interpolation  step  is  circumvented  because  the  nodes 
are  the  same  on  each  level.  There  are  however  a  few  potential  drawbacks  to  this 
approach.  First,  it  is  not  clear  how  to  distribute  the  user-defined  tolerance  among  the 
levels.  Clearly,  satisfying  the  user-specified  tolerance  on  the  prediction  level  defeats 
the  purpose  of  the  deferred  correction  approach.  Estimating  a  reduced  tolerance 
criterion  may  be  possible  a  priori,  but  such  an  estimate  would  at  present  be  ad  hoc. 
Second,  there  is  no  reason  to  expect  the  corrector  (4)  should  take  the  same  steps  to 
satisfy  an  error  tolerance  when  computing  a  numerical  approximation  to  the  error 
equation  (3). 

3.2.  Adaptive  step-size  control:  all  levels.  A  generalization  of  the  above  formula¬ 
tion  is  to  utilize  adaptive  step-size  control  to  solve  the  error  equations  (3)  as  well. 
The  variant  we  consider  is  step  doubling  on  all  levels,  where  each  predictor  and 
corrector  performs  Algorithm  1 ;  embedded  RK  pairs  can  also  be  used  to  estimate 
the  error  for  step-size  adaptivity  on  all  levels.  Intuitively,  step-size  control  on  every 
level  gives  more  opportunity  to  detect  and  adapt  to  error  than  simply  adapting  using 
the  (lowest- order)  predictor.  For  example,  this  allows  the  corrector  take  a  smaller 
step  if  necessary  to  satisfy  an  error  tolerance  when  solving  the  error  equation.  Some 
drawbacks  are:  (i)  an  interpolation  step  is  necessary  because  the  nodes  are  generally 
no  longer  in  the  same  locations  on  each  level,  (ii)  more  memory  registers  are 
required,  and  (iii)  there  is  a  potential  loss  of  parallel  efficiency  because  a  corrector 
may  be  stalled  waiting  for  an  adequate  stencil  to  become  available  to  compute  a 
quadrature  approximation  to  the  integral  in  (4).  Another  issue  —  both  a  potential 
benefit  and  a  potential  drawback  —  is  the  number  of  parameters  that  can  be  tuned 


290 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


RIDC  WITH  ADAPTIVE  STEP-SIZE  CONTROL 


9 


correction  (£  =  3) 


correction  (£  =  2) 


correction  (£  =  1) 


prediction  (£  =  0) 


t 


Figure  2.  Schematic  diagram  of  a  scenario  when  step-size  control  is  applied  on  all  levels. 
Unlike  in  Figure  I,  here  each  level  has  its  own  grid  in  time.  Solid  circles  indicate  particular 
times  and  levels  where  the  solution  is  known.  In  this  particular  diagram,  levels  £  =  0,  1,3 
are  all  able  to  advance  simultaneously  to  the  open  circles.  However,  correction  level  1  =  2 
is  unable  to  advance  to  the  time  indicated  by  the  triangle  symbol  because  correction  level 
£  =  1  has  not  yet  computed  far  enough.  The  stencil  of  points  required  by  each  level 
is  shown  by  the  “bubbles”  surrounding  certain  grid  points;  the  thick  horizontal  shading 
indicates  the  integrals  needed  in  (4).  Note  in  particular  that  the  dashed  stencil  includes  a 
open  circle  at  level  £  =  1  that  is  not  yet  computed. 


for  each  problem.  A  discussion  on  the  effect  of  tolerance  choices  for  each  level  is 
provided  in  Section  4.  One  can  in  practice  also  tune  step-size  control  parameters 
a,  fJ>.  atol,  and  rtol  for  Algorithm  1  separately  on  each  level.  Figure  2  highlights 
that  some  nodes  might  not  be  able  to  compute  an  updated  solution  on  their  current 
level  if  an  adequate  stencil  is  not  available  to  approximate  the  integral  in  (4)  using 
quadrature.  In  this  example,  the  level  1  =  2  correction  is  unable  to  proceed  because 
it  would  require  interpolated  solution  values  not  yet  available  from  level  1  =  1, 
whereas  the  prediction  level  i  =  0  and  corrections  i  =  1  and  1  =  3  are  all  able  to 
advance  the  solution  by  one  step. 

3.3.  Adaptive  step-size  control:  using  the  error  equation.  A  third  strategy  one 
might  consider  is  adaptive  step-size  control  for  the  error  equation  (3)  using  the 
solution  to  the  error  equation  itself  as  the  error  estimate.  (One  still  uses  step 
doubling  or  embedded  RK  pairs  to  obtain  an  error  estimate  for  step-size  control 
on  the  predictor  equation  (1).)  At  first  glance,  this  looks  promising  provided  the 
order  of  the  integrator  can  be  established  because  it  is  used  to  determine  an  optimal 
step-size.  One  would  expect  computational  savings  from  utilizing  available  error 
information,  as  opposed  to  estimating  it  via  step  doubling  or  an  embedded  RK  pair. 

If  first-order  predictor  and  first-order  correctors  are  used  to  construct  the  RIDC 
method,  the  analysis  in  [17]  can  be  easily  extended  to  the  proposed  RIDC  methods 
with  adaptive  step-size  control.  We  note  that  the  numerical  quadrature  approximation 
given  in  (5)  and  the  numerical  interpolation  given  in  (7)  are  accurate  to  the  order 
©(A t„+2)',  this  is  sufficient  for  the  inductive  proof  in  [17]  to  hold.  Hence,  one 
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can  show  that  the  method  has  a  formal  order  of  accuracy  ©(A tl+2),  where  At  = 
ma xn,e(tlP  - 1^\). 

Although  the  formal  order  of  accuracy  can  be  established,  using  the  error  estimate 
from  successive  levels  is  a  poor  choice  for  optimal  step-size  selection.  Consider 
step-size  selection  for  level  l,  time  step  tjf\  using  /;)/ 1  —  '(/j/1)  as  the  error 

estimator  in  Algorithm  1.  The  optimal  step-size  is  chosen  to  control  the  local 
error  estimate  via  the  step-size  Atjf  1  =  / j/ 1  —  However,  the  local  error  for  the 
correctors  generally  contains  contributions  from  the  solutions  at  all  the  previous 
levels.  The  validity  of  the  asymptotic  local  error  expansion  of  the  RIDC  method  in 
terms  of  A/j/1  requires  that  At  =  max,,  ^^1  —  t^\)  be  sufficiently  small,  and  it  is 
not  normally  possible  to  guarantee  this  in  the  context  of  an  IVP  solver.  In  other 
words,  the  step-size  controller  for  a  corrector  at  a  given  level  cannot  control  the 
entire  local  error,  and  hence  standard  step-control  strategies,  which  are  predicated 
on  the  validity  of  error  expansions  in  terms  of  only  the  step-size  to  be  taken,  cannot 
be  expected  to  perform  well.  We  present  some  numerical  tests  in  Section  4.2.4 
to  illustrate  the  difficulties  with  using  successive  errors  as  the  basis  for  step-size 
control. 

3.4.  Further  discussion.  There  are  many  other  strategies/implementation  choices 
that  affect  the  overall  performance  of  the  adaptive  RIDC  algorithm.  Some  have 
already  been  discussed  in  the  previous  section.  We  summarize  some  of  the  imple¬ 
mentation  choices  that  must  be  made: 

•  The  choice  of  how  to  estimate  the  error  of  the  discretization  must  be  made.  Three 
possibilities  have  already  been  mentioned:  step  doubling,  embedded  RK  pairs, 
and  solutions  to  the  error  equation  (3).  A  combination  of  all  three  is  also  possible. 

•  If  an  IVP  method  with  adaptive  step-size  control  is  used  to  solve  (3),  choices 
must  be  made  as  to  how  the  tolerances  and  step-size  control  parameters,  a  and  /l, 
are  to  be  chosen  for  each  correction  level. 

We  also  list  a  few  implementation  details  that  should  be  considered  when  de¬ 
signing  adaptive  RIDC  schemes. 

•  If  adaptive  step-size  control  is  implemented  on  all  levels,  some  correction  levels 
may  sit  idle  because  the  information  required  to  perform  the  quadrature  and 
interpolation  in  (4)  is  not  available.  This  idle  time  adversely  affects  the  parallel 
efficiency  of  the  algorithm.  One  possibility  to  decrease  this  idle  time  is  instead 
of  taking  an  “optimal  step”  (as  suggested  by  the  step-size  control  routine),  one 
could  take  a  smaller  step  for  which  the  quadrature  and  interpolation  stencil  is 
available.  There  is  some  flexibility  in  choosing  exactly  which  points  are  used 
in  the  quadrature  stencil,  and  it  might  also  be  possible  to  choose  a  stencil  to 
minimize  the  time  that  correction  levels  are  sitting  idle. 
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•  Because  values  are  needed  from  lower-order  correction  levels,  the  storage  required 
by  a  RIDC  scheme  depends  on  when  values  can  be  overwritten  (see,  e.g.,  the 
stencils  in  Figures  1  and  2).  Thus  to  avoid  increasing  the  storage  requirements, 
the  prediction  level  and  each  correction  level  should  not  be  allowed  to  get  too  far 
ahead  of  higher  correction  levels.  Although  this  is  also  the  case  for  the  nonadaptive 
RIDC  schemes  [10;  6],  if  adaptive  step-size  control  is  implemented  on  all  levels 
(Figure  2),  the  memory  footprint  is  likely  to  increase.  Some  consideration  should 
thus  be  given  to  a  potential  trade-off  between  parallel  efficiency  and  the  overall 
memory  footprint  of  the  scheme. 

•  It  is  important  to  reduce  round-off  error  when  computing  the  quadrature  weights  (6) 
and  the  interpolation  weights  (8).  This  can  be  done  by  through  careful  scaling 
and  control  of  the  order  of  the  floating-point  operations  [3]. 

•  If  one  wishes  to  use  higher-order  correctors  and  predictors  to  construct  RIDC 
integrators,  we  note  that  the  convergence  analysis  in  [7;  8;  5]  only  holds  for 
uniform  steps.  A  non  uni  form  mesh  introduces  discrete  “roughness”  (see  [8]); 
hence,  an  increase  of  only  one  order  per  correction  level  is  guaranteed  even  though 
a  high-order  method  is  used  to  solve  (3). 

•  RIDC  methods  necessarily  incur  computational  overhead  costs,  for  example,  quad¬ 
rature  evaluation  (5),  interpolation  evaluation  (7),  and  the  combination  of  these 
components  in  (4).  Parallel  speedup  can  only  be  expected  if  the  computational 
overhead  is  small  compared  to  the  advance  of  predictor  from  tn  to  tn+\. 

Additionally,  the  RIDC  framework,  by  construction,  solves  a  series  of  error 
equations  to  generate  a  successively  more  accurate  solution.  This  framework  can 
be  potentially  be  exploited  to  generate  order-adaptive  RIDC  methods.  For  example, 
one  might  control  the  number  of  corrector  levels  adaptively  based  on  an  error 
estimate. 


4.  Numerical  examples 

We  focus  on  the  solutions  to  three  nonlinear  IVPs.  The  first  is  presented  in  [1];  we 
refer  to  it  as  the  Auzinger  I  VP: 

y[  =  ~y2  +  yi(l  -y\  -y\), 

■  y'2  =  yi+  3y2(l  ~  yj  ~  yf),  (AUZ) 

y(0)  =  (l,0)r,  t  e  [0,10], 

that  has  the  analytic  solution  y(t)  =  (cos  t,  sin?)r. 
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The  second  is  the  IVP  associated  with  the  Lorenz  attractor: 

r 

y[  =  °(y2-yi), 

y2  =  pyi 

^3  =  yiy2-Py3, 

y(0)  =  (l,l,l)r,  t  e  [0, 1], 


(LORENZ) 


For  the  parameter  settings  o  —  10,  p  =  28,  =  8/3,  this  system  is  highly  sensitive 

to  perturbations,  and  an  IVP  integrator  with  adaptive  step-size  control  may  be 
advantageous. 

The  third  is  the  restricted  three-body  problem  from  [15];  we  refer  to  it  as  the 
Orbit  IVP: 


//  ,  0  ,  ,y i  +  /*  >’i  - 11 

y  1  =Vi+2v9-/x  — - fi—- , 

U[  U  2 

y2  =  V2  ~2  y  —  t 

U  2 

Di  =  ((yi  +  n)1  +  yf)3/2,  Z)2  =  ((yi  -  m')2  +  y22)3/2, 


(ORBIT) 


li  =  0.012277471,  = 


Choosing  the  initial  conditions 

yt(0)  =  0.994,  y'(0)  =  0,  y2(0)  =  0, 

y2(0)  =  -2.00158510637908252240537862224, 


gives  a  periodic  solution  with  period  tend  =  17.065216560159625588917206249. 

We  now  present  numerical  evidence  to  demonstrate  that: 

1.  RIDC  integrators  with  non  uni  form  step-sizes  converge  and  achieve  their  de¬ 
signed  orders  of  accuracy. 

2.  RIDC  methods  with  adaptive  step-size  constructed  using  step  doubling  (on  the 
prediction  level  only)  and  embedded  RK  error  estimators  (on  the  prediction 
level  only)  converge. 

3.  RIDC  methods  with  adaptive  step-size  control  based  on  step  doubling  to 
estimate  the  local  error  on  the  prediction  and  correction  levels  converge; 
however,  the  step-sizes  selected  are  poor  (many  rejected  steps),  even  for  the 
smooth  Auzinger  problem. 

4.  RIDC  methods  with  adaptive  step-size  control  based  on  step  doubling  to 
estimate  the  local  error  on  the  prediction  level  but  using  the  solution  to  the 
error  equation  for  step-size  control  results  is  problematic. 

The  numerical  examples  chosen  are  canonical  problems  designed  to  illustrate  the 
step-size  adaptivity  properties  of  the  RIDC  methods.  Because  the  computational 
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overhead  is  significant  compared  to  the  advance  of  an  Euler  solution  from  time 
tn  to  tn. |_i,  a  runtime  analysis  does  not  reveal  parallel  speedup  for  any  of  these 
examples.  Whereas  the  number  of  function  evaluations  is  an  effective  parameter  for 
comparing  algorithms,  we  need  a  different  metric  to  compare  a  parallel  algorithm 
to  a  sequential  algorithm.  Where  appropriate,  we  tabulate  the  number  of  sets  of 
concurrent  function  evaluations  as  a  proxy  for  measuring  parallel  speedup  when 
the  function  evaluation  costs  dominate.  A  set  of  concurrent  function  evaluations 
consists  of  function  evaluations  that  can  be  evaluated  in  parallel. 


4.1.  RIDC  with  nonuniform  step-sizes.  For  our  first  numerical  experiment,  we 
demonstrate  that  RIDC  integrators  with  nonuniform  step-sizes  converge  and  achieve 
their  design  orders  of  accuracy.  Figure  3  shows  the  classical  convergence  study 
(error  as  a  function  of  mean  step-size)  for  the  RIDC  integrator  applied  to  (AUZ). 
Figure  3(a)  shows  the  convergence  of  RIDC  integrators  with  uniform  step-sizes; 
Figure  3(b)— (d)  shows  the  convergence  of  RIDC  integrators  when  random  step-sizes 
are  chosen.  The  random  step-sizes  are  chosen  so  that 


AfF 


G 


~Atn- V  0jAtn-\ 
CO 


M 


CO  G  U, 


where  co  controls  how  rapidly  a  step-size  is  allowed  to  change.  The  figures  show 
that  RIDC  integrators  with  nonuniform  step-sizes  achieve  their  designed  order  of 
accuracy  (each  additional  correction  improves  the  order  of  accuracy  by  one),  at 
least  up  to  order  6.  In  Figure  3  (corresponding  to  RIDC  with  uniform  step-sizes), 
we  observe  that  the  error  stagnates  at  a  value  significantly  larger  than  machine 
precision.  This  is  likely  due  to  numerical  issues  associated  with  quadrature  on 
equispaced  nodes  [14].  We  note  that  co  =  1  gives  the  uniformly  distributed  case. 
We  also  observe  that  as  the  ratio  of  the  largest  to  the  smallest  cell  increases,  the 
performance  of  higher-order  RIDC  methods  degrades,  likely  due  to  round-off  error 
associated  with  calculating  the  quadrature  and  interpolation  weights. 

Figure  4  shows  the  convergence  study  (error  as  a  function  of  mean  step-size) 
for  (FORENZ).  The  reference  solution  is  computed  using  an  RK-45  integrator 
with  a  fine  time  step.  Similar  observations  can  be  made  that  RIDC  methods  with 
nonuniform  step-sizes  converge  with  their  designed  orders  of  accuracy  (at  least  up 
to  order  6). 


4.2.  Adaptive  RIDC.  We  study  four  different  variants  of  RIDC  methods  with  adap¬ 
tive  step-size  control:  (i)  step  doubling  is  used  for  adaptive  step-size  control  on 
the  prediction  level  only  (Section  4.2.1);  (ii)  an  embedded  RK  pair  is  used  for 
adaptive  step-size  control  on  the  prediction  level  only  (Section  4.2.2);  (iii)  step 
doubling  is  used  for  adaptive  step-size  control  on  the  prediction  and  correction 
levels  (Section  4.2.3);  and  (iv)  step  doubling  is  used  for  adaptive  step-size  control 
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Figure  3.  Auzinger  IVP:  The  design  order  is  illustrated  for  the  RIDC  methods. 


on  the  prediction  level,  and  the  computed  errors  from  the  error  equation  (3)  are 
used  for  adaptive  step-size  control  on  the  correction  levels. 

4.2.1.  Step  doubling  on  the  prediction  level  only.  In  this  numerical  experiment,  we 
solve  the  orbit  problem  (ORBIT)  using  a  fourth-order  RIDC  method  (constructed 
using  forward  Euler  integrators),  and  adaptive  step-size  control  on  the  prediction 
level  only,  where  step  doubling  is  used  to  provide  the  error  estimate.  As  shown  in 
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Figure  4.  Lorenz  IVP:  the  design  order  is  illustrated  for  the  RIDC  methods. 


Figure  5,  successive  correction  loops  are  able  to  reduce  the  error  in  the  solution 
and  recover  the  desired  orbit.  The  red  circles  in  Figure  5(a)  indicate  rejected 
steps.  Figure  6(a)  shows  that  RIDC  with  step  doubling  only  on  the  prediction  level 
converges  as  the  tolerance  is  reduced.  In  this  experiment,  the  RIDC  integrator  is 
reset  after  every  100  accepted  steps.  By  “reset”  [10],  we  mean  that  the  highest- 
order  solution  after  every  100  steps  is  used  as  an  initial  condition  to  reinitialize  the 
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(a)  Prediction. 
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(b)  First  correction. 
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(c)  Second  correction. 
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(d)  Third  correction. 


Figure  5.  Orbit  problem:  although  the  prediction  level  gives  a  highly  inaccurate  solution, 
successive  correction  loops  are  able  to  reduce  the  error  and  produce  the  desired  orbit.  The 
red  circles  on  the  prediction  level  (a)  indicate  rejected  steps. 


provisional  solution;  e.g.,  instead  of  solving  (1),  one  solves  a  sequence  of  problems 

y'(t)  =  f(t,y),  t  e  [f100(;_i),  min(Z>,  Ooof)L 
T(000(/-1))  =  ^100(1  —  1)’ 

if  (L  —  1)  correctors  are  applied  and  =  ya.  The  time  steps  chosen  by  the  RIDC 

integrator  with  resets  performed  every  100  and  400  steps  are  shown  in  Figure  6(b) 
and  (c). 

In  Figure  6(b),  A/mm  =  1.06  x  10-4.  If  a  nonadaptive  fourth-order  RIDC  method 
was  used  with  Afmtn,  160814  uniform  time  steps  would  have  been  required.  By 
adaptively  selecting  the  time  steps  for  this  example  and  tolerance,  the  adaptive  RIDC 
method  required  approximately  one  one -hundredth  of  the  functional  evaluations, 
corresponding  to  a  one  hundred-fold  speedup.  The  effective  parallel  speedup  can  be 
computed  by  taking  the  ratio  of  the  total  number  of  function  evaluations  required  and 
the  number  of  sets  of  concurrent  function  evaluations  required.  For  the  computation 
in  Figure  6(b)  where  a  reset  is  performed  after  every  100  steps,  the  parallel  speedup 
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(a)  Convergence  study. 


(b)  Adaptive  step-sizes  selected  (reset  every  100  steps). 


(c)  Adaptive  step-sizes  selected  (reset  every  400  steps). 


Figure  6.  Orbit  problem:  (a)  convergence  of  a  fourth-order  RIDC  method  constructed 
with  forward  Euler  integrators  and  adaptive  step-size  control  on  the  prediction  level  (using 
step  doubling).  Convergence  is  measured  relative  to  the  exact  solution  as  the  tolerance  is 
decreased.  A  reset  is  performed  after  every  100  accepted  steps  for  this  convergence  study. 
In  (b),  the  step-sizes  selected  for  rtol  =  10-^'5  and  atol  =  10-6  5  are  displayed  as  the 
solid  curve  and  rejected  steps  as  xs;  a  reset  is  performed  after  every  100  steps.  In  (c),  the 
reset  is  performed  after  every  400  steps.  Observe  that  although  the  number  of  rejected 
steps  increases,  the  overall  A t  chosen  remains  qualitatively  similar. 


(if  four  processors  are  available)  can  be  computed  using 


(1456  x  5) +  99 
(1456  x  2) +  (14  x  6) +  99 


2.38. 
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The  numerator  consists  of  the  total  number  of  function  evaluations  arising  from 
the  number  of  steps  taken  and  the  computation  of  the  error  estimate  using  step 
doubling  and  the  number  of  function  evaluations  arising  from  the  rejected  steps.  The 
denominator  consists  of  the  number  of  concurrent  function  evaluations  (including 
startup  costs  for  the  RIDC  method).  Note  that  three  of  the  processors  sit  idle  while 
that  step  doubling  computation  is  being  processed.  The  parallel  speedup  can  be 
improved  if  more  levels  are  chosen,  or  if  the  number  of  resets  are  reduced.  If  a 
reset  is  performed  after  every  400  steps  (Figure  6(c)),  the  parallel  speedup  is 


(1591  x  5)  +  88 
(1591  x  2)  +  (4  x  6) +  88 


2.44. 


4.2.2.  Embedded  RK  on  the  prediction  level  only.  In  this  numerical  experiment,  we 
repeat  the  orbit  problem  (ORBIT)  using  a  fourth-order  RIDC  method  constructed 
again  using  forward  Euler  integrators,  but  the  step-size  adaptivity  on  the  prediction 
level  uses  a  Heun-Euler  embedded  RK  pair.  This  simple  scheme  combines  Heun’s 
method,  which  is  second  order,  with  the  forward  Euler  method,  which  is  first  order. 
Figure  7(a)  shows  the  convergence  of  this  adaptive  RIDC  method  as  the  tolerance 
is  reduced.  As  the  previous  example,  the  RIDC  integrator  is  reset  after  every  100 
accepted  steps  for  the  convergence  study.  In  Figure  7(b)  and  (c),  we  show  the  time 
steps  chosen  by  the  RIDC  integrator  with  resets  performed  after  100  or  400  steps, 
respectively. 

For  the  computation  in  Figure  7(b)  where  a  reset  is  performed  after  every  100 
steps,  the  parallel  speedup  (if  four  processors  are  available)  is 


(2441  x  5)  +  60 
(2441  x  2)  +  (24  x  6)  +  60 


2.41. 


If  a  reset  is  performed  after  every  400  steps  (Figure  7(c)),  the  parallel  speedup  is 


(2276  x  5)  +  80 
(2276  x  2)  +  (5  x  6)  +  80 


2.46. 


Not  surprisingly,  the  time  steps  chosen  by  the  RIDC  method  are  dependent  on 
the  specified  tolerances  and  the  error  estimator  (and  consequently  the  integrators 
used  to  obtain  a  provisional  solution  to  (1))  used  for  the  control  strategy.  One 
can  easily  construct  a  RIDC  integrator  using  higher-order  embedded  RK  pairs  to 
solve  for  a  provisional  solution  to  (1),  and  then  use  the  forward  Euler  method  to 
solve  the  error  equation  (3)  on  subsequent  levels.  For  example.  Figure  8  shows  the 
step-sizes  chosen  when  the  Bogacki-Shampine  method  [2]  (a  3(2)  embedded  RK 
pair)  and  the  popular  Runge-Kutta-Fehlberg  4(5)  pair  [13]  are  used  to  compute 
the  provisional  solution  (and  error  estimate)  for  the  RIDC  integrator.  The  same 
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(a)  Convergence  study. 
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(b)  Adaptive  step-sizes  selected  (reset  every  100  steps). 


(c)  Adaptive  step-sizes  selected  (reset  every  400  steps). 

Figure  7.  Orbit  problem:  (a)  convergence  of  a  fourth-order  RIDC  method  constructed 
with  forward  Euler  integrators  and  adaptive  step-size  control  on  the  prediction  level  (using 
an  embedded  RK  pair  to  estimate  the  error).  Convergence  is  measured  relative  to  the 
exact  solution  as  the  tolerance  is  decreased.  A  reset  is  performed  after  every  100  accepted 
steps  for  this  convergence  study.  In  (b),  the  step-sizes  selected  for  rtol  =  10-3  5  and 
atol  =  10-6'5  are  displayed  as  the  solid  curve  and  rejected  steps  as  xs;  a  reset  is 
performed  after  every  100  steps.  In  (c),  the  reset  is  performed  after  every  400  steps. 

tolerance  of  rtol  =  10~3  5  is  used  to  generate  both  graphs.  As  the  order  and 
accuracy  of  the  predictor  increases,  one  can  take  larger  time  steps.  For  this  example, 
using  higher-order  embedded  RK  pairs  as  step-size  control  mechanisms  for  RIDC 
methods  result  in  less  variations  in  time  steps. 
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Figure  8.  Step-sizes  selected  by  RIDC  methods  constructed  using  a  Bogacki-Shampine 
method,  a  3(2)  embedded  pair  (red)  and  the  Runge-Kutta-Fehlberg  4(5)  pair.  Rejected 
steps  are  indicated  with  xs. 


4.2.3.  Step  doubling  on  all  levels.  As  mentioned  in  Section  3.2,  it  might  be  ad¬ 
vantageous  to  use  adaptive  step-size  control  when  solving  the  error  equations. 
This  affords  a  myriad  of  parameters  that  can  be  used  to  tune  the  step-size  control 
mechanism.  In  this  set  of  numerical  experiments,  we  explore  how  the  choice  of 
tolerances  for  the  prediction/correction  levels  affect  the  step-size  selection. 

We  first  solve  the  Auzinger  IVP  using  step  doubling  on  all  the  levels,  i.e.,  both 
predictor  and  corrector  levels.  In  Figure  9,  we  show  the  computed  step-sizes  when 
we  naively  choose  the  same  tolerances  on  each  level.  As  expected,  the  predictor 
has  to  take  many  steps  (to  satisfy  the  stringent  user-supplied  tolerance),  whereas 
life  is  easy  for  the  correctors.  The  effective  parallel  speedup  is 

(5479  +  196  +  19  +  24)  x  2  +  15 

(5481  x  2)  +  15  ““  L  4' 
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Figure  9.  Auzinger  IVP:  step-size  control  is  implemented  on  all  prediction  and  correction 
levels.  The  same  tolerances  are  used  for  each  level.  As  expected,  the  predictor  has  a  hard 
time  (forward  Euler  must  satisfy  a  stringent  tolerance);  on  the  other  hand,  life  is  easy  for 
the  correctors.  Rejected  steps  are  indicated  with  xs.  For  this  set  of  tolerances,  5481  sets  of 
concurrent  function  evaluations  are  needed. 
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In  principle,  the  correctors  are  not  even  needed.  Equally  important  to  note  is  that 
the  error  increases  after  the  last  correction  loop.  This  might  seem  surprising  at  first 
glance  but  ultimately  may  not  unreasonable  because  the  steps  selected  to  solve  the 
third  correction  are  not  based  on  the  solution  to  the  error  equation  but  rather  the 
original  IVP. 

Instead  of  naively  choosing  the  same  tolerances  on  each  level,  we  now  change 
the  tolerance  at  each  level,  as  described  in  Figure  10.  By  making  this  simple  change, 
the  number  of  accepted  steps  on  each  level  are  now  on  the  same  order  of  magnitude. 
Not  surprisingly,  the  predictor  still  selects  good  steps.  Interestingly  in  Figure  10(a), 
the  first  correction  is  “noisy”,  especially  initially.  For  this  set  of  tolerances,  the 
effective  parallel  speedup  is 

(58  +  7  +  30  +  61)  x  2  +  (52  +  7  +  24) 

(135  x  2) +  (52  +  7  +  24) 
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(a)  Set  1  of  tolerances. 
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(b)  Set  2  of  tolerances. 


Figure  10.  Auzinger  IVP:  different  tolerances  at  each  level.  With  the  first  set  of  tolerances, 
the  step-size  controller  for  the  predictor  is  well  behaved,  as  it  is  for  the  second  and  third 
correctors.  The  step-size  controller  for  the  first  corrector  however  is  noisy.  135  sets 
of  concurrent  function  evaluations  are  needed  to  generate  (b).  With  the  second  set  of 
tolerances,  the  step-size  controller  for  all  correctors  is  reasonably  well  behaved.  Here,  64 
sets  of  concurrent  function  evaluations  are  needed. 
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By  picking  a  different  set  of  tolerances,  we  can  eliminate  the  noise,  as  shown  in 
Figure  10(b).  For  this  set  of  tolerances,  the  parallel  speedup  is 

(58  +  24  +  20  +  39)  x  2  + (12  +  6+  11)  _  |  ^ 

(64x2) +  (12 +  6+ 11) 

4.2.4.  Using  solutions  from  the  error  equation.  As  mentioned  in  Section  3.3,  using 
the  solution  from  the  error  equation  (3)  as  the  local  error  estimate  for  step-size 
control  on  a  given  level  is  potentially  problematic  because  the  step-size  controller 
can  only  control  the  local  error  introduced  on  that  level  whereas  the  true  local  error 
generally  contains  contributions  from  all  previous  levels.  For  completeness,  we 
present  the  results  of  this  adaptive  RIDC  formulation  applied  to  the  Orbit  problem 
(Figure  12)  and  the  Auzinger  problem  (Figure  11).  Step  doubling  is  used  for  step- 
size  adaptivity  on  the  predictor  level,  solutions  from  the  error  equation  are  used  to 
control  step-sizes  for  the  corrector  levels.  For  the  Auzinger  problem,  we  observe 
in  the  top  figure  that  if  the  tolerances  are  held  fixed  on  each  level,  each  correction 
level  improves  the  solution.  If  the  tolerance  is  reduced  slightly  on  each  level,  the 
step-size  controller  gives  a  poor  step-size  selection  (many  rejected  steps),  even 
for  this  smoothly  varying  problem.  For  the  Orbit  IVP,  Figure  12  shows  that  the 
corrector  improves  the  solution  if  the  tolerances  are  held  fixed  at  all  levels;  however 
the  corrector  requires  many  steps.  A  second  correction  loop  was  not  attempted. 
Reducing  the  tolerance  for  the  first  corrector  resulted  in  inordinately  many  rejected 
steps. 


5.  Conclusions 

In  this  paper,  we  formulated  RIDC  methods  that  incorporate  local  error  estimation 
and  adaptive  step-size  control.  Several  formulations  were  discussed  in  detail:  (i) 
step  doubling  on  the  prediction  level,  (ii)  embedded  RK  pairs  on  the  prediction  level, 
(iii)  step  doubling  on  the  prediction  and  error  levels,  and  (iv)  step  doubling  for  the 
prediction  level  but  using  the  solution  from  the  error  equation  for  step-size  control; 
other  formulations  are  also  alluded  to.  A  convergence  theorem  from  [17]  can  be 
extended  to  RIDC  methods  that  use  adaptive  step-size  control  on  the  prediction  level. 
Numerical  experiments  demonstrate  that  RIDC  methods  with  nonuniform  steps 
converge  as  designed  and  illustrate  the  type  of  behavior  that  might  be  observed 
when  adaptive  step-size  control  is  used  on  the  prediction  and  correction  levels. 
Based  on  our  numerical  study,  we  conclude  that  adaptive  step-size  control  on  the 
prediction  level  is  viable  for  RIDC  methods.  In  a  practical  application  where  a 
user  gives  a  specified  tolerance,  this  prescribed  tolerance  must  be  transformed  to  a 
specific  tolerance  that  is  fed  to  the  predictor. 
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Figure  11.  Auzinger  problem:  step  doubling  on  prediction  level,  using  successive  levels 
for  error  estimation  for  step  control  on  the  error  equation.  Step-size  controller  for  the 
corrector  is  noisy. 
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ABSTRACT 


The  aim  of  this  paper  is  to  build  up  the  theoretical  framework  for  the  recovery  of 
sparse  signals  from  the  magnitude  of  the  measurements.  We  first  investigate  the 
minimal  number  of  measurements  for  the  success  of  the  recovery  of  sparse  signals 
from  the  magnitude  of  samples.  We  completely  settle  the  minimality  question  for 
the  real  case  and  give  a  bound  for  the  complex  case.  We  then  study  the  recovery 
performance  of  the  minimization  for  the  sparse  phase  retrieval  problem.  In 
particular,  we  present  the  null  space  property  which,  to  our  knowledge,  is  the  first 
Keywords:  sufficient  and  necessary  condition  for  the  success  of  t\  minimization  for  fc-sparse 
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1.  Introduction 


The  theory  of  compressive  sensing  has  generated  enormous  interest  in  recent  years.  The  goal  of  compres¬ 
sive  sensing  is  to  recover  a  sparse  signal  from  its  linear  measurements,  where  the  number  of  measurements 
is  much  smaller  than  the  dimension  of  the  signal,  see  e.g.  [4-6,12] .  The  aim  of  this  paper  is  to  study  the 
problem  of  compressive  sensing  without  the  phase  information.  In  this  problem  the  goal  is  to  recover  a 
sparse  signal  from  the  magnitude  of  its  linear  samples. 

Recovering  a  signal  from  the  magnitude  of  its  linear  samples,  commonly  known  as  phase  retrieval  or 
phaseless  reconstruction ,  has  gained  considerable  attention  in  recent  years  [1,2, 7, 8].  It  has  important  ap¬ 
plication  in  X-ray  imaging,  crystallography,  electron  microscopy,  coherence  theory  and  other  applications. 
In  many  applications  the  signals  to  be  reconstructed  are  sparse.  Thus  it  is  natural  to  extend  compressive 
sensing  to  the  phase  retrieval  problem. 
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We  first  introduce  the  notation  and  briefly  describe  the  mathematical  background  of  the  problem.  Let 
J :  =  {/i,  fi,  •  ■  ■ ,  fm}  be  a  set  of  vectors  in  where  H  is  either  E  or  C.  Assume  that  x  E  IIId  such  that 
bj  =  | (x,  fj)\.  The  phase  retrieval  problem  asks  whether  we  can  reconstruct  x  from  {bj }JLi-  Obviously,  if 
y  =  cx  where  |c|  =  1  then  \  {y,  fj)\  =  |(x,  fj } | .  Thus  the  best  phase  retrieval  can  do  is  to  reconstruct  x  up  to 
a  unimodular  constant. 

Consider  the  equivalence  relation  ~  on  H  :=  Ud:  x  ~  y  if  and  only  if  there  is  a  constant  c  £  EH  with 
|c|  =  1  such  that  x  —  cy.  Let  H  :=  H/~.  We  shall  use  x  to  denote  the  equivalent  class  containing  x.  For  a 
given  £F  in  H  define  the  map  Mjr  :  H  — »  E™  by 


M_f(x)  =  [\{xJl)\2,---,\{x,fm)\2]T- 


(1.1) 


The  phase  retrieval  problem  asks  whether  an  x  E  H  is  uniquely  determined  by  M^a;),  i.e.  x  is  recoverable 
from  Mjp(x).  We  say  that  a  set  of  vectors  J-  has  the  phase  retrieval  property,  or  is  phase  retrievable,  if  Mj 
is  injective  on  H  =  Hd/~. 

It  is  known  that  in  the  real  case  EH  =  E  the  set  J -  needs  to  have  at  least  m  >  2d  —  1  vectors  to  have  the 
phase  retrieval  property;  furthermore  a  generic  set  of  m  >  2d  —  1  elements  in  Ed  will  have  the  phase  retrieval 
property,  (cf.  Balan,  Casazza  and  Edidin  [1]).  In  the  complex  case  H  =  C  the  same  question  remains  open, 
and  is  perhaps  the  most  prominent  open  problem  in  phase  retrieval.  It  is  known  that  m  >  4d  —  2  generic 
vectors  in  Cd  have  the  phase  retrieval  property  [1].  The  result  is  improved  to  m  >  Ad  —  4  in  [10].  The 
to  =  Ad  —  4  vectors  having  the  phase  retrieval  property  are  also  constructed  in  [3].  The  current  conjecture 
is  that  phase  retrieval  property  in  Cd  can  only  hold  when  m  >  Ad  —  4. 

The  aforementioned  results  concern  the  general  phase  retrieval  problem  in  Md.  In  many  applications, 
however,  the  signal  x  is  often  sparse  with  ||a:||o  =  k  <C  d. 

We  use  the  standard  notation  Hd  to  denote  the  subset  of  M.d  whose  elements  x  have  ||a;||o  <  k.  Let 
H&  denote  Hd/  A  set  J-  of  vectors  in  M.d  is  said  to  have  the  k -sparse  phase  retrieval  property,  or  is 
k-sparse  phase  retrievable,  if  any  x  G  H/;  is  uniquely  determined  by  Mjr(x).  In  other  words,  the  map  Mjp 
is  injective  on  H&.  One  may  naturally  ask:  How  many  vectors  does  J-  need  to  have  so  that  J-  is  k-sparse 
phase  retrievable? 

The  best  current  results  on  the  fc-sparse  phase  retrieval  property  are  proved  by  Li  and  Voroninski  [16], 
which  state  that  L-sparse  phase  retrieval  property  can  be  achieved  by  having  m  >  4 k  and  to  >  8k  vectors 
for  the  real  and  complex  case,  respectively  (see  also  [18]). 

In  Section  2,  we  prove  sharper  results  for  a  set  of  vectors  T  to  have  the  fc-sparse  phase  retrieval  property. 
In  the  real  case  I  =  Ewe  obtain  a  sharp  result.  We  show  that  for  any  k  <  d  the  set  J-  must  have  at  least 
m  >  2k  elements  to  be  A:-sparse  phase  retrievable.  Furthermore,  any  to  >  2k  generic  vectors  will  be  fc-sparse 
phase  retrievable.  In  the  complex  case  i  =  C  we  proved  that  any  m  >  4k  —  2  generic  vectors  have  the 
Ar-sparse  phase  retrieval  property.  We  conjecture  that  this  bound  is  also  sharp,  namely  for  k  <  d  a  set  T  in 
Cd  needs  at  least  4k  —  2  vectors  to  have  the  A:-sparse  phase  retrieval  property. 

A  foundation  of  compressive  sensing  is  built  on  the  fact  that  the  recovery  of  a  sparse  signal  from  a 
system  of  under-determined  linear  equations  is  equivalent  to  finding  the  extremal  value  of  t\  minimization 
under  certain  conditions.  The  minimization  is  extended  to  the  phase  retrieval  in  [17]  and  one  also  develops 
many  algorithms  to  compute  it  (see  [20,22]).  However,  there  have  been  few  theoretical  results  on  the  recovery 
performance  of  minimization  for  sparse  phase  retrieval.  In  Section  3,  we  present  the  null  space  property, 
which,  to  our  knowledge,  is  the  first  sufficient  and  necessary  condition  for  the  success  of  £\  minimization 
for  fc-sparse  phase  retrieval.  If  we  take  k  =  d,  the  null  space  property  is  reduced  to  a  condition  of  the  set  of 
vectors  J-  under  which  M^r  is  injective  on  Cd/^  ahH3ve  present  it  in  Section  4. 
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2.  Minimal  sample  number  for  fc-sparse  phase  retrieval 

In  this  section  we  study  the  problem  of  minimal  number  of  samples  (measurements)  required  for  fc-sparse 
phase  retrieval.  We  shall  introduce  more  notations  here.  Often  it  is  convenient  to  identify  a  set  of  vectors 
^={/l,/2  , . . . ,  fm}  with  the  matrix  F  =  [/i,  f2,  ■  •  • ,  fm\  whose  columns  are  the  vectors  fj .  When  F  is  a 
frame  this  is  known  as  the  frame  matrix  of  F .  We  shall  use  the  term  frame  matrix  for  F  regardless  whether 
F  is  a  frame  or  not.  Also  for  integers  n  <  m  we  use  the  notation  [n  :  m]  to  denote  the  set  (n,  n  +  1, . . . ,  m}. 
For  letf,  we  set  |x|  :=  [|aq|, . . . ,  \xd\].  Similar  to  before,  we  let 

Rfc  :=  {x  G  Rd  :  ||x||o  <  fc}. 

Our  first  theorem  completely  settles  the  minimality  question  for  fc-sparse  phase  retrieval  in  the  real  case 
H  =  R. 


Theorem  2.1.  Let  F  =  {/i, . . . ,  fm}  be  a  set  of  vectors  in  Rd.  Assume  that  F  is  k-sparse  phase  retrievable 
on  Rd.  Then  m  >  min{2fc,  2d—  1}.  Furthermore,  a  set  F  of  m>  min{2fc,  2d—  1}  genetically  chosen  vectors 
in  Rd  is  k-sparse  phase  retrievable. 

Proof.  Note  that  the  full  sparsity  case  fc  =  d  is  already  known:  m  >  2d  —  1  vectors  are  needed  for  phase 
retrieval  and  a  generic  set  of  F  with  in  >  2d  —  1  vectors  will  have  the  phase  retrieval  property.  So  we  will 
focus  only  on  fc  <  d. 

We  first  prove  that  m  >  2k.  Assume  F  has  m  <  2k  elements.  We  prove  F  does  not  have  the  fc-sparse 
phase  retrieval  property  by  constructing  x,  y  G  Rd  with  \(x,  fj)\  =  \{y,fj)\  but  x  ^  ±y. 

We  divide  F  into  two  groups:  F\  =  {fj  :  j  G  [1  :  fc]}  and  F2  =  {fj  ■  j  G  [fc+1  :  m]}.  Let  the  corresponding 
frame  matrices  be  F\  and  F2 ,  respectively.  Consider  the  subspace 

W  =  {[ x1,x2,  .  .  .,xk+i,0, . . .  ,0]T  G  Rd  :  xu. . .  ,xk+l  G  R}. 

For  the  first  group  F\ ,  there  exists  a  u  G  W  \  {0}  such  that  F3  u  =  0,  i.e.  (fj,u)  =  0  for  all  1  <  j  <  fc.  This 
is  because  dim  (IF)  =  fc  +  1  and  there  are  only  fc  equations.  Note  also  that  there  are  at  most  fc  —  1  vectors 
in  the  second  group  F2  since  m  —  fc  <  2k  —  k  —  fc.  Thus  the  solution  space 

{v  G  W  :  Fjv  =  0} 

has  dimension  at  least  2.  Hence,  there  exist  linearly  independent  a,  G  W  so  that  for  all  t,sGl 

v  =  ta  +  s(3 


satisfies 


Fj v  =  0,  i.e.  (fj,v)  =  0  for  j  G  [fc  +  1  :  m]. 

Write  u  =  [m,  u2, . . . ,  Ud]T  (where  Uj  =  0  for  j  >  k  +  1).  Since  a  and  /3  are  linearly  independent,  we  may 
without  loss  of  generality  assume  [ai,  a2]r  and  [/?i,  f32\T  are  linearly  independent,  where  a  =  [a\, ... ,  ad\T 
and  (5  =  [/?i, . . . ,  (3d]T ■  We  first  consider  the  case  where  either  u\  ^  0  or  u2  ^  0.  Then  there  exist  sq,  to  G  R 
with  (so,to)  7^  (0,0)  so  that 


ui  =  toa\  +  so  (d\, 

—u2  =  toct^-\3so(d2- 
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Now  set  v  =  toa  +  sq/3  and 


x  :=  u  +  v,  y  u  —  v. 

Clearly  xjelJ  since  suppfy)  C  {1, 3, . . . ,  k  +  1}  and  supp(y)  C  {2,  3, . . . ,  k  +  1}.  Moreover 


and  similarly 


=  (fj,u)  +  (fj,v) 


(fj,v)  j  <  k 
j  >  k, 


{fj,v)  j<k 

j>k. 


It  follows  that  \(fj,x)\  =  \  (fj,y}\  for  all  j  but  x  fy  ±y.  We  next  consider  the  case  where  u\  =  u-2  =  0.  Then 
there  exist  So,fo  G  R  with  (s0,to)  fy  (0,0)  so  that 


0  =  t0a  i  +  s0/3 1, 
1  =  fo®2  +  So/?2- 


Similar  to  before,  we  set  v  =  t0a  +  so(3  and 

x  :=  u  +  v,  y  :=  u  —  v. 

Then  x,y  £  R$!  and  \(fj,x}\  =  \(fj,y)\  for  all  j  but  x  fy  ±y.  Thus  J7  does  not  have  the  k- sparse  phase 
retrieval  property  in  Rjj?. 

We  next  prove  that  a  set  T  of  m  >  2k  generic  vectors  will  have  the  fc-sparse  phase  retrieval  property. 
Let  us  first  fix  /,  J  C  [1  :  N]  with  #/  =  #  J  =  k.  The  goal  is  to  prove  that  if  x,  y  £  Rj^  with  supp(x)  C  / 
and  supp(y)  C  J  satisfying 


\(fj’x)\2  =  \  {fj,y)\2,  J  =  !,•••,  m, 

then  x  —  ±y.  Eq.  (2.1)  implies  that  for  all  j  we  have 


(2.1) 


(fj,x~y)  •  (fj,x  +  y)  =  0.  (2.2) 

Thus  either  (fj,x  —  y)  —  0  or  (fy,  x  +  y)  =  0.  Without  loss  of  generality,  we  assume  that 

(fj,x-y)  =  0,  je[  l:n] 

(fj,x  +  y)  =  0,  je[n  +  l:m].  (2.3) 

Set 

L  :=  I  fl  J  and  £  :=  #L. 


For  convenience  we  write 


x  =  ux  +  vx,  supp(ux)  C  L,  suppfyx)  C  I  \  L, 

y  =  uy  +  vv,  supp(itJ/^t4L,  supp(nJ/)  C  J  \  L. 
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We  abuse  the  notation  a  little  by  viewing  vx  £  1  since  it  is  supported  on  I  \  L  with  #(I  \  L)  =  k  —  £. 

Similarly  we  view  vy  £  and  ux,uy  G  Set 


W-  :=  ux  —  u 


yi 


w+  :=  ux  +  uy,  and  z  := 


Vx 

Vy 

W- 

w+ 


d2  k 


Using  the  notions  above,  we  have 


(fj ,x-y)  =  (fj , vx)  -  ( fj ,vy)  +  { fj ,w-), 

{fj,x  +  y)  =  (. fj,vx )  +  (fj,vy)  +  (/,,«>+).  (2.4) 

Set  A  :=  where  F  is  the  frame  matrix  of  J-.  Combining  (2.3)  and  (2.4)  now  yields 


-A 


[l:n],  J\L 


■A[n+l:m],I\L  -^■[n+l:m],J\L 


1[1  :n\,L 

0 


0 

^■[n+l:m]  ,L 


Vy 

IV- 

W+ 


=  0, 


(2.5) 


where  for  any  index  sets  J\,  J2  we  use  the  notation  Aj1}j2  to  denote  the  sub-matrix  of  A  with  the  rows 
indexed  in  J\  and  columns  indexed  in  J2.  To  show  x  =  ±y  we  only  need  to  show  that  the  linear  equations 
(2.5)  force  vx  =  0,  vy  =  0  and  either  w-  =  0  or  w+  =  0. 

We  hrst  consider  the  case  n  >  2k  —  l.  In  this  case,  we  consider  only  the  first  set  of  Eqs.  (2.5) 


-^[1  :n],I\L 


—  A[l:n],J\L  A[i  .n\tL 


W- 


=  0. 


(2.6) 


Note  that  the  matrix 


A[l:n],I\L  ~A[1  :n],J\L  ^[l:n],i 

has  dimensions  n  x  (2k  —  l).  The  elements  are  generically  chosen.  Thus  it  has  full  rank  2k  —  i.  It  follows 
that  (2.6)  has  only  trivial  solution  vx  =  0,  vy  =  0  and  W-  =  0.  Hence  x  =  y. 

We  next  consider  the  case  with  m  —  n  >  2k  —  t.  Here  we  consider  the  second  set  of  Eq.  (2.5): 


^[rt+l  -.m\,I\L  -^[n+l  :m],J\L  ^[rx+limJ.L 


Vx 

Vy 

w+ 


=  0. 


(2.7) 


The  same  argument  used  for  the  case  n  >  2k  —  £  now  applies  to  yield  vx  =  0,  vy  =  0  and  tc+  =  0.  Hence  in 
this  case  x  =  —y. 

We  finally  consider  the  case  where  n  <  2k  —  £  and  m  —  n  <  2k  —  £.  In  this  case  we  must  have 


2  k  —  £  >  m  —  n>  2k  —  n, 


and  hence  n  >  £.  Similarly,  we  have  £  <  2k  —  n  <  m  —  n.  We  argue  that  the  rank  of  the  matrix  in  (2.5)  is 
2k  when  FT  is  generic.  Let  B  denote  the  matrix  in  fflJS).  If  rank(H)  <  2k  then  all  2k  x  2k  sub- matrices  of 
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B  have  determinant  0.  Note  that  each  determinant  is  either  identically  0  or  a  nontrivial  polynomial  of  the 
entries  of  F.  Hence  if  there  exists  a  single  example  of  a  matrix  B  with  rank(H)  =  2k  then  rank(H)  =  2k 
for  a  generic  choice  of  F.  We  shall  construct  an  example  of  such  an  F  with  rank(H)  =  2k.  Set 


A[1  :n],L 


*  [n+l:m],L 


h 

0 


[^[l:n],J\L)  ~A[1  :n],J\i] 


0 

Hi 


[^■[n+l  ^[n+l:m],  J\l] 


0 

h2 


where  F  denotes  the  £  x  £  identity  matrix.  With  this  choice,  for  almost  all  Hi  E  M.(n~£>x(2k~2£\  H2  E 
j^(m-ix-£)x(2fc-2^)  we  have  rank (5)  =  2k.  The  solution  to  (2.5)  is  thus  trivial,  namely  vx  =  0,  vy  =  0, 
W-  =  0  and  w+  =  0.  Thus  x  —  y  =  0.  The  theorem  is  now  proved.  □ 


We  next  consider  the  complex  case.  Similar  to  the  real  case  we  set 

C%  :=  {x  E  Cd  :  ||z||0  <  k}. 


Then  we  have 

Theorem  2.2.  A  set  F  of  m>Ak  —  2  generically  chosen  vectors  in  Cd  is  k-sparse  phase  retrievable. 

Proof.  We  shall  identify  F  with  F  where  F  =  {/i,  f2, . . . ,  fm}  is  the  corresponding  frame  matrix,  F  =  [fij]. 

Following  the  technique  in  [1]  we  shall  view  F  as  an  element  in  R2md.  The  goal  here  is  to  show  that  the 

set  of  matrices  F  that  are  not  Aasparse  phase  retrievable  has  local  real  dimension  strictly  smaller  than  2 md 
provided  m  >  Ak  —  2. 

For  any  subset  of  indices  I,  J  C  [1  :  d\  with  ffl  =  ffj  =  k  let  Gpj  denote  the  set  of  matrices  in  Cdxm 
with  the  following  property:  There  exist  x,  y  E  Cd  where  supp(x)  C  /,  supp(y)  C  J  and  x  ^  cy  with  |c|  =  1 
such  that  M_7t(x)  =  Mjr(y),  i.e.  \{fj,x)\  =  \(fj,y)\  for  all  j.  Now  if  Mjr(a;)  =  Mjr(y),  then  for  any  a,ui  E  C 
with  |cj|  =  1  we  also  have  Mjr(ax)  =  M.jr(aujy).  Thus  for  any  F  E  we  may  find  x,y  E  Cd  with 
M_7r(x)  =  M jr(y)  such  that 

•  supp(x)  C  /,  supp(y)  C  J. 

•  The  first  nonzero  entry  of  x  is  1. 

•  The  first  nonzero  entry  of  y  is  real  and  positive. 

Let  X  denote  the  subset  of  Cd  consisting  of  elements  x  E  Cd  whose  first  nonzero  entry  is  1.  Let  Y  denote 
the  subset  of  Cd  consisting  of  elements  y  E  Cd  whose  first  nonzero  entry,  if  it  exists,  is  real  and  positive. 
Note  that  in  essence  X  can  be  viewed  as  the  projective  space  Prf_1  \  {0}  and  Y  can  be  viewed  as  the  set 
Cd/~.  Let  Cf  denote  the  set  of  vectors  x  E  Cd  such  that  supp(x)  C  I.  Now  consider  the  set  of  3-tuples 

Ai,j  :=  {{F,x,y)} 


with  the  following  properties: 

.  x  E  XnCj  and  y  E  Y  flCd;. 

•  x  uy  for  any  uj  E  C  with  |w|  =  1. 

.  M_7t(x)  =  Mjr(y).  316 
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Now  the  projection  of  Apj  to  the  first  component  gives  the  full  set  Gitj.  Each  (F,  x,  y)  G  Ar,  j  gives  rise 
to  the  constraints  \(fj,x}\  =  \  (fj,y)  \  for  j  G  [1  :  rn],  which  lead  to  the  set  of  quadratic  equations  in  R e(fij), 
Im(  fij)  (by  viewing  x,  y  as  fixed) 


N 


yi  fkjXk 


k= 1 


N 


fkjdk 


k= 1 


3  =  1,  •  • 


,  m. 


(2.8) 


Note  that  all  equations  are  independent  and  each  is  non-trivial  because  x  ^  y  in  Cd/~.  Thus  for  any  fixed 
x,  y  the  set  of  such  A  =  [fij]  satisfying  (2.8)  is  a  real  algebraic  variety  of  (real)  codimension  m.  Hence,  Apj 
has  local  dimension  everywhere  at  most 

2 md  —  m  +  dim^X  fl  Cf)  +  dim^E  fl  C^)  =  2 md  —  m  +  2k  —  2  +  2k  —  1  =  2 md  —  (m  —  Ak  +  3). 

It  follows  from  m  >  Ak  —  2  that  Aitj  has  local  (real)  dimension  at  most  2 md  —  1.  Now  Gij  is  the  projection 
of  Ai.j  onto  the  first  component.  Thus,  G/,j  has  dimension  at  most  2 md  —  1.  In  other  words,  a  generic 
F  G  Cdxm  is  not  in  G/,j- 

Finally,  the  set  of  F  G  Cdxm  not  having  the  A;- sparse  phase  retrieval  property  for  Cf  is  the  union  of  all 
Gitj  with  =  k.  It  is  a  finite  union.  The  theorem  is  now  proved.  □ 

Remark.  Although  the  above  theorem  shows  that  in  the  complex  case  any  m  >  4A:  —  2  generically  chosen 
vectors  are  A-sparse  phase  retrievable,  it  is  unknown  whether  4 k  —  2  is  in  fact  the  minimal  number  required. 
It  will  be  interesting  to  use  the  technology  developed  in  10]  to  improve  the  result. 

3.  Null  space  property  for  sparse  phase  retrieval 

In  this  section,  we  investigate  the  performance  of  i\  minimization  for  sparse  phase  retrieval  with  extending 
the  null  space  property  in  compressed  sensing  to  the  phase  retrieval  setting.  We  first  introduce  the  null  space 
property  in  compressed  sensing,  and  then  extend  it  to  the  phase  retrieval  setting  on  Rf  and  C^,  respectively. 

3.1.  Null  space  property 

A  key  concept  in  compressive  sensing  is  the  so-called  null  space  property  of  a  matrix.  For  a  given  frame 
F  =  {fi, . . . ,  fm}  C  M.d,  we  use  F  to  denote  the  frame  matrix.  Let  A f(F)  denote  the  kernel  of  FT ,  i.e., 

Af(F)  =  {i)£lJ:  (fj,r])  =  0  ,j  =  1, . . .  ,m}. 

To  state  conveniently,  when  F  =  $,  we  set  Af(F)  :=  Hd. 

Definition  3.1.  The  matrix  F  satisfies  the  null  space  property  of  order  k  if  for  any  nonzero  rj  =  [r]i, ... ,  rhi]T  G 
A f(F)  and  any  T  C  [1  :  d]  with  ffT  <  k  it  holds  that 

IM|i  <  ||??Tc||l, 

where  Tc  is  the  complementary  index  set  of  T  and  is  the  restriction  of  tj  to  T. 

A  fundamental  result  in  compressed  sensing  is  that  a  signal  x  G  M.d  can  be  recovered  via  the  i\  mini¬ 
mization  if  and  only  if  the  sensing  matrix  A  has  the  null  space  property  of  order  k.  We  state  it  as  follows 
(see  [9,13-15,19]):  317 
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Theorem  3.1.  Let  F  be  a  set  of  vectors  in  and  F  be  the  associated  frame  matrix.  Then  F  satisfies  the 
null  space  property  of  order  k  if  and  only  if  it  has 

argminj  ||x||i  :  FTx  =  FTxo}  =  Xo 

xeud 


for  every  xq  eij. 

3.2.  The  null  space  property  for  the  real  sparse  phase  retrieval 

Our  goal  here  is  to  extend  Theorem  3.1  to  the  phase  retrieval  for  the  real  signal.  For  a  given  frame 
F  =  {/i, . . . ,  fm}  and  a  subset  S  of  [1  :  to]  we  shall  use  Fs  to  denote  the  set  Fs  :=  {fj  :  j  E  S}.  Similarly 
for  the  frame  matrix  we  shall  use  Fs  to  denote  the  corresponding  frame  matrix  of  Fs,  be.  the  matrix  whose 
columns  are  the  vectors  of  Fs-  We  first  consider  the  real  case. 

Theorem  3.2.  Let  F  =  {/i,  f2,  ■  ■  ■ ,  fm}  be  a  set  of  vectors  in  Rd  and  F  be  the  associated  frame  matrix.  The 
following  properties  are  equivalent: 

(A)  For  any  xq  £  we  have 


argmin{ ||x||i  :  |FTa;|  =  |FTa:o|}  =  {±x0},  (3.1) 


where  |FTa;|  =  [|(/i,x)|, . . . , \{fm,x)\]T . 
(B)  For  every  S  C  [1  :  m\,  it  holds 


||rt  +  u||i  <  ||u  —  n||i 

for  all  nonzero  u  E  N'(Fs)  and  v  E  J\f(Fs =)  satisfying  ||u  +  u||o  <  k. 

Proof.  First  we  show  (B)  =>  (A).  Let  b  =  [&i, . . . ,  bm)T  :=  |FlTXo|  where  Xo  E  For  a  fixed  e  £  {1,  — l}m 
set  b£  :=  [tibi, . . . ,  embm]T .  We  now  consider  the  following  minimization  problem: 

min||a;||i  s.t.  FTx  =  be.  (3.2) 

The  solution  to  (3.2)  is  denoted  as  xe.  We  claim  that  for  any  e  E  {1,  — l}m  we  must  have 

ll^elll  >  |ko||l 

if  xe  exists  (it  may  not  exist),  and  the  equality  holds  if  and  only  if  xe  =  ±®o- 

To  prove  the  claim  let  e*  £  {1,  — l}m  such  that  b£zt  =  FTx o-  Note  that  property  (B)  implies  the  classical 
null  space  property  of  order  k.  To  see  this,  for  any  nonzero  ??  £  J\f(F)  and  T  C  [1  :  d]  with  ffT  <  k,  set 
u  :=  rj  and  v  :=  r/T  —  r/Ta.  Let  S  =  [1  :  to].  Then  u  E  J\f(Fs)  and  v  E  Af(Fs<--).  The  hypothesis  of  (B)  now 
implies 


2llr?r||i  =  ||u  +  v ||i  <  ||u  —  v|| i  =2\\t]t^\\i- 

Consequently  we  must  have  xe*  =  xo  by  Theorem  3.1.  Now  for  any  e  £  {  —  1,  l}m  ^  ±e*,  if  xe  doesn’t  exist 
then  we  have  nothing  to  prove.  Assume  it  does  exisilSet  S'*  :=  {j  :  ej  =  e*}.  Then 
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(fj,Xe) 


(fj,x 0)  j  e  5*, 
~(fj,x o)  j  G  s*. 


Set  u  x o  —  xe  and  v  :=  xo  +  £e.  Clearly  u  G  JV(Fs , )  and  v  G  J\T(Fs <=).  Furthermore  u  +  w  =  2xo  G  Kjj!.  By 

the  hypothesis  of  (B)  we  must  have 


2||xo||i  =  ||u  +  u||i  < 


2\\xe 


i- 


This  proves  (A). 

Next  we  prove  (A)  =>  (B).  Assume  (B)  is  false,  namely,  there  exist  nonzero  u  G  JV(Fs)  and  v  G  J\f(Fs<=) 
such  that  ||u  +  w||i  >  ||u  —  n||i  and  u  +  v  G  U.f.  Now  set 


Xq  :=  u  +  v  G  M.d. 


Clearly, 


|  </j ,  a^o>  |  =  \{fj,u  +  v)\  =  \{fj,u-v)\,  j  =  1, ...  ,m 

since  either  (fj,u)  =  0  or  (fj,v)  =  0.  In  other  words,  |.FT£o|  =  \FT  (u  —  u)|.  Note  that  u  —  v  ^  —xo,  for 
otherwise  we  would  have  u  =  0,  a  contradiction.  It  follows  from  the  hypothesis  of  (A)  that  we  must  have 

||xq  ||i  =  ||u  +  u||i  <  \\u  —  n||i. 


This  is  a  contradiction.  □ 

Remark.  Theorem  3.2  extends  results  for  the  null  space  property  of  order  k  in  compressive  sensing  to  phase 
retrieval.  It  will  be  very  interesting  for  constructing  matrix  A  G  M.mxd  with  m  x  fclogd  satisfying  (B)  in 
Theorem  3.2. 

3.3.  The  null  space  property  for  the  complex  sparse  phase  retrieval 

We  now  consider  the  complex  case  H  =  C.  Throughout  this  subsection,  we  say  that  S  =  {Si, . . . ,  Sp}, 
p  >  2,  is  a  partition  of  [1  :  m]  if 

p 

Sj  C  [1  :  m],  |^J  Sj  =  [1  :  in]  and  Sj  fl  S1^  =  0  for  all  j  ^  t. 

3  =  i 

To  state  conveniently,  we  set  S  :=  {c  G  C  :  |c|  =  1}  and 

Sm  :=  {(d,  ...,cm)  G  Cm  :  \cj\  =  1  ,j  G  [1  :  m]}. 


Then  we  have: 

Theorem  3.3.  Let  F  =  {/i,  f2,  ■  ■  ■ ,  fm}  be  «  set  of  vectors  in  Cd  and  F  be  the  associated  frame  matrix.  The 
following  properties  are  equivalent. 

(A)  For  any  xq  G  we  have 

argmin{||a;||i  :  |  =  |-FT£o|}  =  x0, 
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where  |FTa;|  =  [\{fi,x)\, . . . , \{fm,x)\]r  and  xq  denotes  the  equivalent  class  {cx o  :  c  £  S}  in  Cd/~ 
containing  xq. 

(B)  Suppose  that  Si .... ,  Sp  is  any  partition  of  [1  :  m\  and  that  ci, . . . ,  cp  £  S  are  any  p  pairwise  distinct 
complex  numbers.  If  r/j  £  A^-F^.)  \  {0},  j  £  [1  :  m\,  satisfy 

— - —  =  — - -eC^\{0}  for  all  £,j  £  [2  :  p],  (3.4) 


then 


WVj-  Vih  <  \\ceVj  -  CjTieWi, 


for  all  j ,  t  £  [1  :  p]  with  j  ^  I. 

Proof.  We  first  show  (B)  =>-  (A).  Let  b  =  [&i, . . .  ,bm]T  '■=  |-FT£o|  where  a’o  £  For  a  fixed  e  £  Sm  set 
be  :=  [ei&i, . . . ,  embm]T .  We  now  consider  the  following  minimization  problem: 

min||a:||i  s.t.  FTx  =  be.  (3.5) 

The  solution  to  (3.5)  is  denoted  as  xe.  We  claim  that  for  any  e  £  Sm  we  must  have 

IKIli  >  ||*o||i 

if  xe  exists  (it  may  not  exist),  and  the  equality  holds  if  and  only  if  xe  =  x0. 

To  prove  the  claim  let  e*  £  Sm  such  that  be*  =  FTx o-  Note  that  property  (B)  implies  the  classical 
null  space  property  of  order  k.  To  see  this,  take  Si  =  [1  :  m],  S2  =  0,  c\  =  1  and  C2  =  —1.  Then  (3.4) 
is  reduced  to  require  that  771  —  772  G  C^,  i.e.,  pi  —  rj2  is  fc-sparse.  Given  T  C  [1  :  d]  with  ffT  <  k  and 
Vi  e  N(F)  =  N(FSl),  set 


ri2  ■=  {vi )t<=  ~  {Vi )t  G  N(Fs2)  =  Cd. 


Then  r]i  —  r]2  £  The  (B)  implies  that 

2|Mt||i  =  hi  -V2W1  <  hi  +V2W  =  2\\(rji)Tc\\v 

which  implies  the  classical  null  space  property. 

Consequently  we  must  have  xe*  =  xo  by  Theorem  3.1.  Now  we  consider  an  arbitrary  e  £  Sm.  If  e  =  e*, 
then  xe  =  xo-  So,  we  only  consider  the  case  where  e  7^  e*.  If  xe  does  not  exist  then  we  have  nothing  to 
prove.  Assume  it  does  exist.  Set  c'j  :=  f-j/F-  and  r/'  :=  c'jXe *  —  xe  for  1  <  j  <  m.  We  can  use  c(-  to  define 
an  equivalence  relation  on  [1  :  m],  namely  j  ~  I  if  c(-  =  c^.  This  equivalence  relation  leads  to  a  partition 
S  =  (Si, . . . ,  Sp}  of  [1  :  m\.  Now  we  set  Cj  where  I  £  Sj.  Clearly  all  Cj,  1  <  j  <  p,  are  distinct  and 

unimodular. 

Now  set  r]j  :=  cjx£ *  —  xe ■  Then  we  have 


Vj  e  A f(FSj)  \  {0},  for  all  j  £  [1  :  p\ 


and 


— - —  =  — - —  G  for  all  j,  i£[2:p\. 

Cl  —  Cj  C1  -  Q 
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By  the  hypothesis  of  (B)  we  must  have 

I Cj  -  ci |  •  ||x0||i  =  II Vj  -  %||i  <  \\ceVj  -  CjVe II 1  =  | Cj  ~  Q |  •  IKHi, 


which  implies  that 


Zolll  <  ll^elll- 


This  proves  (A). 

We  next  prove  (A)  (B).  Assume  (B)  is  false,  namely,  there  exist  nonzero  r/j  £  A j  £  [1  :  p] 

satisfying  (3.4)  but 


ll^io  ??^olll  —  WctoVjo  Cjo^olll 

for  some  distinct  jo,£o  £  [1  :  p\.  Note  that  (3.4)  implies  that 


=  Vm  Vn  £  Cd  ^  {Q}i 
Cj  Cf>  Cm  Cn 


(3.6) 


for  all  j,  £,  m,  n  £  [1  :  p]  with  j  ^  i  and  m  ^  n.  Without  loss  of  generality,  we  assume  that  jo  =  1,  £o  =  2, 
i.e., 


||??i  ~  V2W1  >  \\c2V1  -  Cl 7/2 ||i- 


(3.7) 


Set 


:=  m  -  7/2, 


and  (3.6)  implies  that  xo  £  Cf  \  {0}.  We  claim  that 

|  ( fj ,  xo)  |  =  |  ( fj  ,Vi-m)\  =  \  ( fj ,  c27?i  -  ci 7/2)  | ,  for  all  j  £  [1  :  p] . 

Note  that  xq  is  fc-sparse.  Combining  (3.8),  (3.7)  and  (3.3)  now  yields 


(3.8) 


CX 0  =  cr/i  -  0)2  =  c2T)  1  -  Cl 7/2 


for  some  c  £  S.  Consequently  we  obtain 


(c-c2)7/i  =  (c-ci)t/2, 


which  implies  that 


1)2  =  - ?/i.  (3.9) 

C  —  Cl 

Here,  note  that  c  ^  (ci,  c2},  for  otherwise  we  will  have  either  7/1  =  0  or  t/2  =  0.  Combining  (3.4)  and  (3.9) 
leads  to 

•  ?/i  is  ^-sparse; 

•  for  all  j  £  [2  :  p] ,  rjj  and  7/1  are  linear  dependent 33id  hence  7/1  £  A f(Fsj). 
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And  hence  we  have  FTr]i  =  0.  By  the  hypothesis  of  (A)  and  771  £  Cf  we  have  771  =  0.  A  contradiction. 

We  remain  to  prove  (3.8).  First,  when  j  E  S i  U  S2,  (3.8)  holds,  since  either  (fj,r]  1)  =  0  or  {fj.  772)  =  0. 
We  consider  the  case  where  j  £  S3.  Set  yo  :=  .  Then  (3.6)  implies  that 

m  -V3  m  -  rj  3 

- = - =  yo 

Cl  -  c3  C2  -  c3 


and  hence 


?7i  =  (ci  -  c3)y0  +  n 3, 
m  =  (C2  -  c3)y0  +  %■ 

Note  that  (/;,  7/3)  =  0  with  j  £  S3.  Then 

|(/i,c2r/i  -  Ci772) |  =  |(/j,c2(ci  -  c3)y0  -  ci(c2  -  c3)y0>| 

=  |</j,c3(ci  -c2)yo)\  =  |</j,»7i  —  «72>|  =  |(/i,®o)|- 

Using  a  similar  argument,  we  easily  prove  the  claim  for  j  £  S4, . . . ,  Sp.  □ 

Remark.  When  p  =  2,  Eq.  (3.4)  is  reduced  to  ip  —  y2  E  Cd  \  {0}.  And  hence,  if  take  p  =  2,  the  (B)  in 
Theorem  3.3  implies  that 


||f?l  -  772  111  <  I|C2??1  -  Cl772  ||  1 

for  all  nonzero  771  £  J\f( F$ ,)  and  rj2  £  A/’(i7s2)  satisfying  771  —  t)2  £  \  {0}  and  all  ci, c2  £  S  with  c2. 

Remark.  In  [21],  Tillmann  and  Pfetsch  investigated  the  computational  complexity  of  the  classical  null  space 
property,  and  d’Aspremont  and  Ghaou  also  designed  algorithms  to  test  it  [11].  It  will  be  very  interesting  to 
extend  the  result  to  the  null  space  property  introduced  in  Theorem  3.3  in  the  future  research. 

4.  Null  space  property  for  general  phase  retrieval 

Theorem  3.2  and  Theorem  3.3  present  the  null  space  property  for  the  phase  retrievable  on  and  C^, 
respectively.  In  phase  retrieval,  one  is  also  interested  in  the  condition  under  which  F  is  phase  retrievable  on 
or  Cd.  For  the  real  case,  such  a  condition  is  presented  in  [1]: 

Theorem  4.1.  ( See  [1].)  Let  F  =  {/1,  f2, . . . ,  fm}  be  a  set  of  vectors  in  Rd  and  F  be  the  associated  frame 
matrix.  The  following  properties  are  equivalent: 

(A)  F  is  phase  retrievable  on  M.d; 

(B)  For  every  subset  S  C  (1, . . .  ,777.},  either  {fj}jes  spans  Rd  or  {fj}jesc  spans  Rd. 

We  next  consider  the  complex  case.  Motivated  by  Theorem  3.3,  we  can  present  the  null  space  property 
under  which  F  is  phase  retrievable  on  <Cd.  It  can  be  considered  as  an  extension  of  Theorem  4.1: 

Theorem  4.2.  Let  F  =  {/1,  f2, . . . ,  fm}  be  a  set  of  vectors  in  Cd  and  F  be  the  associated  frame  matrix.  The 
following  properties  are  equivalent: 

(A)  F  is  phase  retrievable  on  Cd;  322 
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(B)  Suppose  that  S\, ...  ,SP  is  any  partition  of  [1  :  to]  and  that  C\, ...  ,cp  £  §  are  any  p  pairwise  distinct 
complex  numbers.  There  exists  no  ly  £  J\f( Fs ,)  \  {0},  j  =  1, . . .  ,p,  such  that 

— - —  =  — - —  +  0  for  all  £,j  £  [2  :  p\.  (4.1) 

Cl  —  Cl  Cl  —  Cj 

Proof.  We  first  prove  (A)  =>•  (B).  Assume  (B)  is  false,  namely,  there  exist  nonzero  rjj  £  Af(Fsj),  j  £  [1  :  p] , 
satisfying  (4.1).  Set 


%o  ■=  Vi  ~  V2- 

Using  a  similar  method  as  the  proof  of  (3.8),  we  obtain  that 

|  ( fj ,  x0)  |  =  |  ( fj ,  r?i  -  m)  |  =  |  ( fj ,  c2??i  -  ci 772)  | ,  for  all  j  £  [1  :  p] . 
Then,  according  to  (A)  and  the  definition  of  phase  retrievable,  we  have 


cx 0  =  a 71  -  c??2  =  c2  77i  -  c ip2 


for  some  unimodular  constant  c  £  S  \  {ci,  c2},  which  implies  that 

V2  =  ~ — — ?7i •  (4.2) 

C  —  Cl 

Combining  (4.1)  and  (4.2),  we  obtain  that,  for  all  j  £  [2  :  p],  p,  and  pi  are  linear  dependent  and  hence 
771  £  A [(Fs,).  So,  FTpi  =  0.  Then  (A)  implies  that  ry  =  0,  a  contradiction. 

We  next  show  (B)  =►  (A).  Set  b  =  [61, ... ,  bm]T  :=  |FTzo|  where  xq  £  Cd  \  {0}.  For  a  fixed  e  £  Sm  set 
be  :=  [ei&i, . . . ,  em^m]T-  We  now  consider  the  solution  to 

UTx  =  b£.  (4.3) 

The  solution  to  (4.3)  is  denoted  as  x£.  We  claim  that  if  xe  exists  then  xe  =  xo,  which  implies  (A).  Recall 
that  xo  denotes  the  equivalent  class  {cx 0  :  c  £  S)  in  Cd/~  containing  xo-  To  prove  the  claim  let  e*  £  Sm 
such  that  be .  =  FTx o-  Then  (B)  implies  that  the  rank  of  F  is  d.  Consequently  we  must  have  xe»  =  xo- 
Now  we  consider  an  arbitrary  e  £  Sm.  If  1  =  e*,  then  xe  —  Xo.  To  this  end,  we  only  need  prove  that  xe  does 
not  exist  if  e  7^  e*.  Assume  xe  does  exist.  Set  c)  :=  Cj/e*  and  rj-  :=  c)xc*  —  xe  for  1  <  j  <  m.  We  can  use 
c'j  to  define  an  equivalence  relation  on  [1  :  to],  namely  j  ~  t  if  c)  =  c't  This  equivalence  relation  leads  to 
a  partition  S  =  {Si, . . . ,  Sp}  of  [1  :  to].  Now  we  set  ry  :=  c{  where  l  £  Sj.  Clearly  all  Cj,  1  <  j  <  p,  are 
distinct  and  unimodular.  Now  set  r]j  :=  CjXe *  —  xe.  By  dehnition  for  all  1  <  j  <  p  we  have 

%£W(FSi)\{  0} 


and 


Vi  -  Vj 

Cl  —  Cj 


Vi  ~  ye 

Cl  -  Cl 


7^0 


for  all  j,  £  £  [2  :  p] , 


which  contradicts  with  (B).  And  hence  xe  does  not  exist  if  £  ^  i*.  This  proves  (A).  □ 


Remark.  When  p  =  2,  Eq.  (4.1)  is  reduced  to  p  1  —  ^  0  which  in  turn  implies  that  either  M{Fsf)  =  {0} 

or  Af (Fs2)  =  {0}.  323 
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linear  process,  but  the  phases  of  these  measurements  are  often 
unreliable  or  not  available.  To  reconstruct  the  signal,  one  must 
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case  of  real  signals,  we  devise  a  theory  of  “almost”  injective 
intensity  measurements,  and  we  characterize  such  ensembles. 
Later,  we  show  that  phase  retrieval  from  M  +  1  almost 
injective  intensity  measurements  is  NP-hard,  indicating  that 
computationally  efficient  phase  retrieval  must  come  at  the 
price  of  measurement  redundancy. 
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1.  Introduction 

Given  an  ensemble  $  —  {^pn}n=i  °f  M-dimensional  vectors  (real  or  complex), 
the  phase  retrieval  problem  is  to  recover  a  signal  x  from  the  intensity  measurements 
A(x)  {\(x,pn)\2}%=1.  Note  that  for  any  scalar  uj  of  unit  modulus,  Aiuix)  —  A(x), 
and  so  the  best  one  can  hope  to  do  is  recover  the  set  of  signals  {ojx:  |u;|  =  1}.  Intensity 
measurements  arise  in  a  number  of  applications  in  which  phase  is  either  unreliable  or  not 
available,  such  as  diffractive  imaging  [10,30,34,35]  and  optics  [21,35,41],  For  example,  in 
high-power  coherent  diffractive  imaging,  only  the  intensities  of  diffracted  X-rays  can  be 
recorded,  and  so  to  reconstruct  material  density  profiles  one  must  obtain  the  lost  phase 
information  after  the  fact  [10].  Intensity  measurements  also  appear  in  quantum  state 
tomography  when  measuring  a  rank-1  quantum  state  using  a  positive  operator- valued 
measure  (POVM)  consisting  of  rank-1  elements  [27,28,31],  In  most  of  these  applications 
it  is  desirable  to  perform  phase  retrieval  from  as  few  measurements  as  possible,  since  in¬ 
creasing  N  invariably  makes  the  measurement  process  more  expensive  or  time  consuming. 

Recently,  there  has  been  a  lot  of  work  on  algorithmic  phase  retrieval.  For  example,  by 
viewing  intensity  measurements  as  Hilbert-Schmidt  inner  products  between  rank-1  oper¬ 
ators  [3,14],  phase  retrieval  can  be  formulated  as  a  low-rank  matrix  recovery  problem  [12, 
18,23,39],  and  with  this  formulation  phase  retrieval  is  possible  from  N  —  O(M)  intensity 
measurements  [13],  Another  approach  is  to  exploit  the  polarization  identity  along  with 
expander  graphs  to  design  a  measurement  ensemble  and  apply  spectral  methods  to  per¬ 
form  phase  retrieval  [1,6].  One  can  also  formulate  phase  retrieval  in  terms  of  MaxCut, 
and  solvers  for  this  formulation  are  equivalent  to  a  popular  solver  (PhaseLift)  for  the 
matrix  recovery  formulation  [38,40],  While  this  recent  work  has  focused  on  stable  and 
efficient  phase  retrieval  from  asymptotically  few  measurements  (namely,  N  —  O(M)), 
the  present  paper  focuses  on  injectivity  and  algorithmic  efficiency  with  the  absolute 
minimum  number  of  measurements. 

In  the  next  section,  we  construct  an  ensemble  of  N  —  4 M  —  4  measurement  vectors 
in  CA/  which  yield  injective  intensity  measurements.  This  is  the  second  known  injective 
ensemble  of  this  size  (the  first  is  due  to  Bodmann  and  Hamrnen  [9]),  and  it  is  conjectured 
to  be  the  smallest-possible  injective  ensemble  [5].  The  same  conjecture  suggests  that 
4M  —  4  generic  measurement  vectors  yield  injectivity  (that  is,  there  exists  a  measure- zero 
set  of  ensembles  of  4 M  —  4  vectors  such  that  every  ensemble  of  4 M  —  4  vectors  outside 
of  this  set  yields  injectivity).  The  following  summarizes  what  is  currently  known  about 
the  so-called  “4 M  —  4  conjecture”: 

•  The  conjecture  holds  for  M  —  2  and  M  —  2rn  +  1,  m  —  1, 2, 3, . . .  [19]  (cf.  [5]). 

•  If  N  <  4 M  —  2 a(M  —  1)  —  3,  then  A  is  not  injective  [31];  here,  a(M  —  1)  ^  log2  M 
denotes  the  number  of  l’s  in  the  binary  expansion  of  M  —  1. 

•  For  each  M  A  2,  there  exists  an  ensemble  <P  of  N  —  4 M  —  4  measurement  vectors 
such  that  A  is  injective  [9]  (see  also  Section  2  of  this  paper). 

•  \i  N  ^  4 M  —  4,  then  A  is  injective  for  generic  $  [19]  (cf.  [4]). 
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Bodmann  and  Hammen  [9]  leverage  the  Dirichlet  kernel  and  the  Cayley  map  to  prove 
injectivity  of  their  ensemble,  but  it  is  unclear  whether  phase  retrieval  is  algorithmically 
feasible  from  their  ensemble.  By  contrast,  for  the  ensemble  in  this  paper,  we  use  basic 
ideas  from  harmonic  analysis  over  cyclic  groups  to  devise  a  corresponding  phase  retrieval 
algorithm,  and  we  demonstrate  injectivity  in  Theorem  6  by  proving  that  the  algorithm 
recovers  any  noiseless  signal  up  to  global  phase. 

In  Section  3,  we  devise  a  theory  of  ensembles  for  which  the  corresponding  intensity 
measurements  are  “almost”  injective,  that  is,  A~1(A(x))  —  {oox\  |u;|  =  1}  for  almost 
every  x.  In  this  section,  we  focus  on  the  real  case,  meaning  phase  retrieval  is  up  to  a  global 
sign  factor  i v  —  ±1,  and  our  approach  is  inspired  by  the  characterization  of  injectivity  in 
the  real  case  by  Balan,  Casazza  and  Edidin  [4],  After  characterizing  almost  injectivity  in 
the  real  case,  we  find  a  particularly  satisfying  sufficient  condition  for  almost  injectivity: 
that  (I>  forms  a  unit  norm  tight  frame  with  M  and  N  relatively  prime.  Characterizing 
almost  injectivity  in  the  complex  case  remains  an  open  problem. 

We  conclude  with  Section  4,  in  which  we  consider  algorithmic  phase  retrieval  in  the 
real  case  from  N  —  M  + 1  almost  injective  intensity  measurements.  Specifically,  we  show 
that  phase  retrieval  in  this  case  is  NP-hard  by  reduction  from  the  subset  sum  problem. 
The  hardness  of  phase  retrieval  in  this  minimal  case  suggests  a  new  problem  for  phase 
retrieval:  What  is  the  smallest  C  for  which  there  exists  a  family  of  ensembles  of  size 
N  —  CM  +  o(M)  such  that  phase  retrieval  can  be  performed  in  polynomial  time? 

2.  4 M  —  4  injective  intensity  measurements 

In  this  section,  we  provide  an  ensemble  of  4 M  —  4  measurement  vectors  which  yield 
injective  intensity  measurements  for  CM.  The  vectors  in  our  ensemble  are  modulated 
discrete  cosine  functions,  and  they  are  explicitly  constructed  at  the  end  of  this  section. 
We  start  here  by  motivating  our  construction,  specifically  by  identifying  the  significance 
of  circular  autocorrelation ,  which  we  define  in  (1)  below. 

Consider  the  P-dimensional  complex  vector  space  £(Zp)  {u:Z  — »  C:  u\p  +  P]  = 
u[p\,  Mp  G  Z).  The  discrete  Fourier  basis  in  £(Zp)  is  the  sequence  of  P  vectors  {fq}q^iP 
defined  by  fq[p]  e2'Kipq/p  (the  notation  “g  G  Zp”  is  taken  to  mean  a  set  of  coset  repre¬ 
sentatives  of  Z  with  respect  to  the  subgroup  PZ).  The  discrete  Fourier  transform  (DFT) 
on  Zp  is  F*  :£{Zp)  — *  £{Zp ),  with  corresponding  inverse  DFT  ( F *)~1  =  ^F,  defined  by 

=  («./,)  =  E  uW-2mvqjF 

p€Zp 

(. Fv)\p]  =  E  =  E 

qEZp  qE%p 

Now  let  Tp:£(Zp)  -A  £(Z p)  be  the  translation  operator  ( Tpu)[p ']  :=  u\p'  —  p).  The 
circular  autocorrelation  of  u  is  then  CirAut(u)  G  £{Z p),  defined  entrywise  by 

CirAut(w)[p]  :=  ( u,Tpu )  =  —  p\. 
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Consider  the  DFT  of  a  circular  autocorrelation: 

( F *  CirAut('u))  [g]  =  E  E  u[p']u[p'  -  p\e-2nipqlP 

p€%p  p'EZp 

—  u[p']e~2’K'ip  q^p  f  'Y  u[p'  —  p]e-27rdp,-p)9/-p 

p'€Zp  ^p€Zp 

=  Y  u[p']e~2mp'q/p{  Y  u[p"]e~2niP"q/p\  =  \(u,fq)\2.  (2) 

p'eZp  'p"€Zp  ' 

As  such,  if  one  has  the  intensity  measurements  {\(u,  fq)\2}qezp,  then  one  may  compute 
the  circular  autocorrelation  CirAut  (w)  by  applying  the  inverse  DFT.  In  order  to  perform 
phase  retrieval  from  {|(w,  fq)\2}qezP,  it  therefore  suffices  to  determine  u  from  CirAut  (it). 
This  is  the  motivation  for  our  approach  in  this  section. 

To  see  how  to  “invert”  CirAut,  let’s  consider  an  example.  Take  x  —  (a,  6,  c )  G  C3  and 
consider  the  circular  autocorrelation  of  x  as  a  signal  in  £(Z:i): 

CirAut  (x)  =  (|a|2  +  \b\2  +  |c|2,  ac  +  ba  +  c6,  ab  +  bc  +  ca ). 

Notice  that  every  entry  of  CirAut (x)  is  a  nonlinear  combination  of  the  entries  of  x, 
from  which  it  is  unclear  how  to  compute  the  entries  of  x.  To  simplify  the  structure, 
we  pad  x  with  zeros  and  enforce  even  symmetry;  then  the  circular  autocorrelation  of 
u  (2a,  b ,  c,  0,  0, 0, 0,  c,  b )  G  £(Zg)  is 

CirAut(n)  =  (4|a|2  +  \b\2  +  |c|2,  2Re(2afc  +  6c),  |6|2  +  4Re(ac),  2Re(6c),  |c|2, 

|c|2,  2Re(6c),  |6|2  +  4Re(ac),  2  Re(2a6  +  6c)).  (3) 

Although  it  still  appears  rather  complicated,  this  circular  autocorrelation  actually  lends 
itself  well  to  recovering  the  entries  of  x. 

Before  explaining  this  further,  first  note  that  9  =  4(3)  —  3,  and  we  can  generalize 
our  mapping  x  i — y  u  by  sending  vectors  in  CM  to  members  of  To  make  this 

clear,  consider  the  reversal  operator  R  \  £(Zp)  — >  i{Zp)  defined  by  ( Ru)\p\  —  u[—p\.  Then 
given  a  vector  x  G  C  ,  padding  with  zeros  and  enforcing  even  symmetry  is  equivalent  to 
embedding  x  in  £{Z^m-z)  by  appending  3M  —  3  zeros  to  x  and  then  taking  u  —  x  +  Rx  G 
•^(^4M— 3) ■  (From  this  point  forward  we  use  x  to  represent  both  the  original  signal  in  CM 
and  the  version  of  x  embedded  in  liZ^M-z)  via  zero-padding;  the  distinction  will  be 
clear  from  context.)  Computing  x  G  CM  then  reduces  to  determining  the  first  M  entries 
of  x  G  liZ^M-z)  from  CirAut (x  +  Rx).  If  x  is  completely  real- valued,  then  this  is  indeed 
possible.  For  instance,  consider  the  circular  autocorrelation  (3).  If  the  entries  of  x  are  all 
real,  then  this  becomes 

CirAut (x  +  Rx)  =  (4a2  +  62  +  c2, 4a6  +  26c,  b2  +  4 ac,  26c,  c2,  c2, 26c, 

62  +  4ac,  4a6  +  26c) . 
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Since  CirAut(x  +  Rx)  [4]  =  c2,  we  simply  take  a  square  root  to  obtain  c  up  to  a  sign. 
Assuming  c  is  nonzero,  we  then  divide  CirAut(x  +  Rx)  [3]  by  2c  to  determine  b  up  to  the 
same  sign.  Then  subtracting  b 2  from  CirAut(x  +  Rx)[ 2]  and  dividing  by  4c  gives  a  up  to 
the  same  sign. 

From  this  example,  we  see  that  the  process  of  recovering  the  entries  of  x  from 
CirAut(x  +  Rx)  is  iterative,  working  backward  through  its  first  2 M  —  2  entries.  But 
what  happens  if  c  is  zero?  Fortunately,  our  process  doesn’t  break:  In  this  case,  we  have 

CirAut(x  +  Rx)  —  (4a2  +  b2,  Aab,  b2 , 0, 0,  0,  0,  62, 4afc). 

Thus,  we  need  only  start  with  CirAut(x  +  Rx)  [2]  to  determine  the  remaining  entries  of 
x  up  to  a  sign.  This  observation  brings  to  light  the  important  role  of  the  last  nonzero 
entry  of  x  in  our  iteration.  The  relationship  between  this  coordinate  and  the  entries  of 
CirAut(x  +  Rx)  will  become  more  rigorous  later. 

The  above  example  illustrated  how  a  real  signal  x  is  determined  by  CirAut(x  +  Rx). 
A  complex-valued  signal,  on  the  other  hand,  is  not  completely  determined  from 
CirAut(x  +  Rx).  Luckily,  this  can  be  fixed  by  introducing  a  second  vector  in 
obtained  from  x,  and  we  will  demonstrate  this  later,  but  for  now  we  focus  on  x  +  Rx.  To 
this  end,  let’s  first  take  a  closer  look  at  the  entries  of  CirAut(x  +  Rx).  Since  this  circular 
autocorrelation  has  even  symmetry  by  construction,  we  need  only  consider  all  entries  of 
CirAut(x  +  Rx)  up  to  index  2 M  —  2.  This  leads  to  the  following  lemma: 

Lemma  1.  Let  x  denote  an  M -dimensional  complex  signal  embedded  in  £(Z4M_3)  such 
that  x\p]  —  0  for  all  p  —  M , . . .  ,4 M  —  4.  Then  CirAut(x  +  Rx)\p]  —  2R e(x,Tpx)  + 
(x,  RT~px )  for  all  p  —  1, ... ,  2 M  —  2. 

Proof.  First  note  that  by  the  definition  of  the  circular  autocorrelation  in  (1)  we  have 

CirAut(x  +  Rx)\p]  =  (x  +  Rx ,  Tp(x  +  Rx ))  =  2  Re(x,  Tpx )  +  (x,  RT~px )  +  (x,  RTpx). 

Thus,  to  complete  the  proof  it  suffices  to  show  that  (x,  RTpx)  —  0  for  all  p  = 
1, ... ,  2 M  —  2.  Since  x  is  only  nonzero  in  its  first  M  entries,  we  have 


M— 1  M— 1  M— 1 

(x,RTpx)  =  ^  x[p'](RTpx)[p']  =  ^2  x[p'](TPx)[-P']  =  ^2  x[p,]x[~P'  ~p]i 

p'= 0  p'= 0  p'= 0 

where  the  summand  is  zero  whenever  —p'  —  p  ^  [0,  M  —  1]  modulo  4 M  —  3.  This  is 
equivalent  to  having  —  p  not  lie  in  the  Minkowski  sum  p'  +  [0,  M  —  1],  and  since  p'  G 
[0,  M  —  1]  we  see  that  (x,  RTpx)  =  0  for  all  p  =  1, . . . ,  2 M  —  2.  □ 

As  a  consequence  of  Lemma  1,  the  following  theorem  expresses  the  entries  of 

CirAut(x  +  Rx)  in  terms  of  the  entries  of  x: 
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Theorem  2.  Let  x  denote  an  M -dimensional  complex  signal  embedded  in  such 

that  x[p]  —  0  for  all  p  —  M, . . . ,  4 M  —  4.  Then  we  have 


M- 1 


CirAut(x  +  Rx)[p]  —  < 


2  Re  j  Y  x  [p']  (x [p'  -p]  +x[p-p']) 
Vp'  =  £± i 
if  p  is  odd 


M—l 


2 Re (  Y  x[p']{x\p'  ~P]  +x[p~P']  ) 

Vp'=f+i 
k  if  p  is  even 


for  all  p  —  1, ,  2 M  —  2. 


(4) 


Proof.  We  first  use  Lemma  1  to  get 
CirAut(a;  +  Rx)\p]  —  2Re(x,  Tpx )  +  ( x ,  RT~px ) 

/  M—l  \ 


M—l 


=  2Re(  Y  x[p']x[pt  -p]  )  +  Y  x[p']x[p-p'] 

\p'= o  /  p'=0 


'M—l 


)min{p,M  — 1} 

+  Y  x  \p']  xip~ p ']  ’ 

p'=max{p—  (M  —  1)  ,0} 


\p'=p 


(5) 


where  the  last  equality  takes  into  account  that  the  first  summand  is  nonzero  only  when 
p'  —  p  G  [0,  M  —  1]  and  the  second  summand  is  nonzero  only  when  p  —  p'  G  [0,  M  —  1], 
i.e. ,  when  p'  G  \p,p  +  (M  —  1)]  and  p'  G  [p  —  (M  —  l),p],  respectively.  To  continue,  we 
divide  our  analysis  into  cases. 

For  p  —  1, . . . ,  M  —  1,  (5)  gives 

(M—l  _ \  p  _ 

E  x  [p'] x[p'  -p\  J  +  Y  x  [p'] x[p~  p']  ■  (6) 

p'=p  /  p'=0 


If  p  is  odd  we  can  then  write 


P-1 

2 


Y  x[p']x[p -p')  =  Y  x[p’]x[p -p']  +  Y  x[p']x[p-p'] 

p'=  0  p'=  0 


„/=p±l 

r  2 


Y  x[p  -  p"]x[p"]  +  Y  x[p']x[p-p'] 


P"= 


p'  =  £±i 


=  2 Re (  Y  x[p']x[p-p']  J, 

Vp'=E±i  J 
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while  if  p  is  even  we  similarly  write 


Y  x[p']x[p-p']  =  2Re| 

f  p  \ 

Y  x[p']x[p  -p']  J  + 

X 

01^3  ' 
- 1 

p'= 0 

vP'=5+i  / 

L  J 

Substituting  (7)  and  (8)  into  (6)  then  gives  (4). 

For  the  remaining  case,  p  =  M, . . . ,  2 M  —  2  and  (5)  gives 

M  — 1  _ 

CirAut(cc  +  Rx)  [p]  =  Y^  x  \p ]  x  [p  —  p'] . 

p'=p—(M—  1) 

Similar  to  the  previous  case,  taking  p  to  be  odd  yields 

M— 1  _  /  M—l  _ \ 

Y  x[p']x[p-p']  =  2Rel  Y  x[p']x[p-p']  I, 

p'=p-{M- 1)  \p/  =  £±i  / 

while  taking  p  to  be  even  yields 


M—l 

(  M~!  \ 

r  -| 

Y  x[p']x[p~P'] 

—  2  Re  1  Y  x[p']x[p  ~  P']  J  + 

X 

P 

2 

p'=p—(M—  1) 

\P'= f+i  / 

and  substituting  (10)  and  (11)  into  (9)  also  gives  (4).  □ 


(8) 


(9) 


(10) 


(11) 


Notice  (4)  shows  that  each  member  of  {CirAut(x  +  Rx)\p\}p= j-2  can  be  written  as  a 
combination  of  the  first  M  entries  of  x ,  but  only  those  at  or  beyond  the  |"|]th  index. 
As  such,  the  index  of  the  last  nonzero  entry  of  x  is  closely  related  to  that  of  the  last 
nonzero  entry  of  (CirAut(x  +  R,x)[p])2p=ff2 .  This  corresponds  to  our  observation  earlier 
in  the  case  of  x  G  R3  where  the  third  coordinate  was  assumed  to  be  zero.  We  identify 
the  relationship  between  the  locations  of  these  nonzero  entries  in  the  following  lemma: 


Lemma  3.  Let  x  denote  an  M -dimensional  complex  signal  embedded  in  such 

that  x[p]  —  0  for  all  p  —  M, . . . ,  4 M  —  4.  Then  the  last  nonzero  entry  of  {CirAut(a;  + 
R,x)  [p]  }p=0_2  has  index  p  —  2 q,  where  q  is  the  index  of  the  last  nonzero  entry  of  x. 

Proof.  If  q  ^  1 ,  then  gives  that  ChAut(:r  A  Rx') [2^]  —  |:r[p]|  -f-  0.  Note  that  since 
x[p']  —  0  for  every  p'  >  q,  (4)  also  gives  that  CirAut(:r-b  Rx)[p\  —  0  for  every  p  >  2 q.  For 
the  remaining  case  where  q  —  0,  (4)  immediately  gives  that  CirAut(a:  +  Rx)[p]  —  0  for 
every  p  ^  1.  To  show  that  CirAut(a;  +  i?x)[0]  7^  0  in  this  case,  we  apply  the  definition  of 
circular  autocorrelation  (1): 

CirAut(x  +  ifcr)[0]  —  (x  +  Rx ,  x  +  Rx)  —  ||rr  +  Rx\\2  —  1 2a; [0]  | 2  7^  0, 

where  the  last  equality  uses  the  fact  that  x  is  only  supported  at  0  since  q  =  0.  □ 
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As  previously  mentioned,  we  are  unable  to  recover  the  entries  of  a  complex  signal 
x  solely  from  CirAut(x  +  Rx ).  One  way  to  address  this  is  to  rotate  the  entries  of  x 
in  the  complex  plane  and  also  take  the  circular  autocorrelation  of  this  modified  signal. 
If  we  rotate  by  an  angle  which  is  not  an  integer  multiple  of  7 r,  this  will  produce  new 
entries  which  are  linearly  independent  from  the  corresponding  entries  of  x  when  viewed 
as  vectors  in  the  complex  plane.  As  we  will  see,  the  problem  of  recovering  the  entries  of 
x  then  reduces  to  solving  a  linear  system. 

Take  any  (AM  —  3)  x  (AM  —  3)  diagonal  modulation  operator  E  whose  diagonal  entries 
{u>k}k=o  are  unit  modulus  satisfying  UjCJk  ^  K  for  all  j  ^  k  and  consider  the  new 
vector  Ex  E  (.(Tj^ms)-  Then  Theorem  2  gives 


CirAut(Ax  +  REx)\p\ 


M- 1 


=  < 


2  Re  |  ^2  up'x  \p']  {^E^pX  [p'  ~  p]  +  AT®  [p  ~  P']  ) 
V=£±i 
if  p  is  odd 


M- 1 


2  Re  (  ^2  uv'x  W]  i^p'-pX  [p'  ~p\+  Up-p'X  [p  -  p']  ) 

\p'= f+i 
if  p  is  even 


x 


(12) 


for  all  p  —  1, . . .  ,2 M  —  2.  We  will  see  that  (4)  and  (12)  together  allow  us  to  solve  for 
the  entries  of  x  (up  to  a  global  phase  factor)  by  working  iteratively  backward  through 
the  entries  of  CirAut(x  +  Rx)  and  CirAutfAx  +  REx).  As  alluded  to  earlier,  each  entry 
index  forms  a  linear  system  which  can  be  solved  using  the  following  lemma: 


Lemma  4.  Let  a,kC  \  {0}  and  w£C\K  with  |cu|  =  1.  Then 

z  —  — 

b  =  - — —  (Re(u;a6)  —  00  Re(afe)) . 

aim  (a;) 

Proof.  After  some  manipulation,  we  have 

Re(tua6)  —  u>  Re(ab)  —  Re(w)  Re(a6)  —  Im(w)  Im(o6)  —  u>  Re(afe) 

=  —  i  Im(u;)  (Re(a6)  —  «Im(a6))  =  —iablm(oo). 

Rearranging  then  yields  the  desired  result.  □ 


(13) 


We  now  use  this  lemma  to  describe  how  to  recover  x  up  to  global  phase.  By  Lemma  3, 
the  last  nonzero  entry  of  {CirAut(x  +  Rx)[p)}p^2  has  index  p  —  2 q,  where  q  indexes 
the  last  nonzero  entry  of  x.  As  such,  we  know  that  x[k]  =  0  for  every  k  >  q,  and 
x[q\  can  be  estimated  up  to  a  phase  factor  (x[g]  =  el^x[q\)  by  taking  the  square  root 
of  CirAut(x  +  Rx)[2q\  —  |x[g]|2  (we  will  verify  this  soon,  but  this  corresponds  to  the 
examples  we  have  seen  so  far).  Next,  if  we  know  Re(x[g]x[fc])  and  Re(uiquJkx[q]x[k])  for 

some  k  <  q,  then  we  can  use  these  to  estimate  x  [k] : 
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x[k]  := - - (Re(ojqUkx[q]x[k] )  —  uqujk  Re(x[g]x[fc] ))  =  el^x[k],  (14) 

x[q\lm(uqujk) 

where  the  last  equality  follows  from  substituting  a  —  x[q],  b  —  x[k]  and  co  =  ojqoJk 
into  (13).  Overall,  once  we  know  x[q\  up  to  phase,  then  we  can  find  x[k)  relative  to  this 
same  phase  for  each  k  =  0, . . . ,  q—  1,  provided  we  know  Re(x[g]x[fc])  and  R e(coquJkx[q\x[k]) 
for  these  k's.  Thankfully,  these  values  can  be  determined  from  the  entries  of  CirAut(x  + 
Rx )  and  CirAut^rr  +  REx ): 

Theorem  5.  Let  x  denote  an  M -  dimensional  complex  signal  embedded  in  iifL^M-z)  such 
that  x\p]  —  0  for  all  p  —  M, . . . ,  4 M  —  4  and  E  be  a  (4 M  —  3)  x  (4 M  —  3)  diagonal 
modulation  operator  with  diagonal  entries  satisfying  \u>k\  —  1  for  all  k  — 

0, . . . ,  AM  —  4  and  oujCJk  £  M  for  all  j  ^  k.  Then  x  can  be  recovered  up  to  a  global  phase 
factor  from  CirAut(x  +  Rx)  and  CirAut(Ax  +  REx). 

Proof.  Letting  q  denote  the  index  of  the  last  nonzero  entry  of  x,  it  suffices  to  estimate 
{x[k]}l=0  up  to  a  global  phase  factor.  To  this  end,  recall  from  Lemma  3  that  the  last 
nonzero  entry  of  {CirAut(x  +  R.x)  \p\  }p=0-2  has  index  p  —  2q.  If  q  —  0,  then  we  have 
already  seen  that  CirAut(x  +  i?x)[0]  =  4|x[0]|2.  Since  there  exists  if  G  [0,  2tt)  such 
that  a;[0]  =  e_*^|a;[0]  |,  we  may  take  x[0]  :=  \  \J CirAut(a;  +  Rx) [0]  =  |a;[0]|  =  e^xfd). 
Otherwise  q  G  [1,  M  —  1],  and  (4)  gives 


(M-l  _  _ 

x  [p']  (x  \p'  —  2 q]  +  x  [2 q  —  p']  ) 

p'=q+ 1 

Thus,  taking  x[q\  \J CirAut(x  +  Rx) [2 q]  —  \x[q]\  gives  us  x[q]  —  el^x[q\  for  some 

if  G  [0, 2ir). 

In  the  case  where  q  —  1,  all  that  remains  to  determine  is  x [0] ,  a  calculation  which 
we  save  for  the  end  of  the  proof.  For  now,  suppose  q  ^  2.  Since  we  already  know 
x[q\  —  el^x[q],  we  would  like  to  determine  x[k]  for  k  —  1, . . . ,  q  —  1.  To  this  end,  take 
r  G  [0,5  —  2]  and  suppose  we  have  x[k]  —  el^x[k]  for  all  k  —  q  —  r, . . . ,  q.  If  we  can  obtain 
x[q  —  (r  +  1)]  up  to  the  same  phase  from  this  information,  then  working  iteratively  from 
r  =  0tor  =  g  —  2  will  give  us  x[k)  up  to  global  phase  for  all  but  the  zeroth  entry  (which 
we  address  later).  Note  when  r  is  even,  (4)  gives 


CirAut(x  +  Rx)  [2 q  —  (r  +  1)] 

=  2Re|  ^2  x[p'](x[p' -  (2q  -  (r  +  1))]  +  x[(2q  -  (r  +  1)) 
V=«- f 

_  <?-i  _ 

=  2Re{x[q]x[q  -  (r  +  1)]  )  +  2  ^  Re(x[p']x[(2q  -  (r  +  1)) 
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where  the  last  equality  follows  from  the  observation  that  p' —  (2q—(r+l))  ^  — g+(r+l)  < 
—  1  over  the  range  of  the  sum,  meaning  x\p'  —  (2 q  —  (r  +  1))]  =  0  throughout  the  sum. 
Similarly  when  r  is  odd,  (4)  gives 


CirAut(x  +  Rx )  [2 q  —  (r  +  1)] 


=  2R e(x[g]x[g  -  (r  +  1)]  ) 


<7-1 


+  2  Re(x[p']x[(2g  -  (r  +  1))  -  p']  )  + 


/  r —  1 

p  =q  2 


x 


In  either  case,  we  can  isolate  Re(x[g]x[g  —  (r  +  1)])  to  get  an  expression  in  terms  of 
CirAut(x  +  Rx)[2q  —  (r  +  1)]  and  other  terms  of  the  form  Re(x[fc]x[fc'])  or  |x[fc]|2  for 
k,k'  G  [g  —  r,  g  —  1].  By  the  induction  hypothesis,  we  have  x[k\  —  el^x[k\  for  k  — 
q  —  r, . . .  ,q  —  1,  and  so  we  can  use  these  estimates  to  determine  these  other  terms: 

Re(x[fc]x[fc'])  =  Re(e*^x[fc]e*^x  [fc'j )  =  Re  (x[fc]x  [&']),  |x[fc]|2  =  |e*^x[fc]|2  =  |x[fc]|2. 

As  such,  we  can  use  Cir Aut (x  +  Rx)  [2g  —  (r  +  1)]  along  with  the  higher-indexed  estimates 
x[k\  to  determine  Re(x[g]x[g  —  (r  +  1)]).  Similarly,  we  can  use  Cir  Aut  (Ax  +  REx)[2q  — 
(r  +  1)]  along  with  the  higher-indexed  estimates  x[k)  to  determine 

Re(wqW(q_(r+1))0;[g]a;[g  —  (r  +  1)]).  We  then  plug  these  into  (14),  along  with  the  es¬ 
timate  x[q]  —  e*^x[g]  (which  is  also  available  by  the  induction  hypothesis),  to  get 
x[2g  —  (r  +  1)]  =  e*^x[2g  —  (r  +  1)]. 

At  this  point,  we  have  determined  {x[k]}l=1  up  to  a  global  phase  factor  whenever 
q  ^  1,  and  so  it  remains  to  find  x[0].  For  this,  note  that  when  q  is  odd,  (4)  gives 


_  <7- 1  _ 

Cir  Aut  (x  +  Rx)[g]  =  4Re(x[g]x[0] )  +  2  Y2  Re(x[p/]x[g  —  p']  ), 

P'=xti 


while  for  even  g,  we  have 


_  <7-i  _ 

CirAut(x  +  Rx)[g]  =  4Re(x[g]x[0] )  +  2  Re(x[p']x[g 

p'=i+ 1 


p'\  )  + 


As  before,  isolating  Re(x[g]x[0])  in  either  case  produces  an  expression  in  terms  of 
CirAut(x  +  Ax)[g]  and  other  terms  of  the  form  Re(x[fc]x[fc'])  or  |x[fc]|2  for  fc,  k'  G  [1,  g  — 1]. 
These  other  terms  can  be  calculated  using  the  estimates  {x[fc]}k~ and  so  we  can  also  cal¬ 
culate  Re(x[g]x[0])  from  CirAut(x  +  Rx)[q\.  Similarly,  we  can  calculate  Re(<x,jWox[g]x[0]) 
from  {x[fc]}^|  and  CirAut(Ax  +  REx)[q],  and  plugging  these  into  (14)  along  with  x[g] 
produces  the  estimate  x[0]  =  e*^x[0].  □ 
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Theorem  5  establishes  that  it  is  possible  to  recover  a  signal  x  G  CA/  up  to  a  global 
phase  factor  from  {CirAut(a;+-Rir)}gA: q~2  and  {CirAut (Ex+REx)}2^^2 .  We  now  return 
to  how  these  circular  autocorrelations  relate  to  intensity  measurements.  Recall  from  (2) 
that  the  DFT  of  the  circular  autocorrelation  is  the  modulus  squared  of  the  DFT  of  the 
original  signal:  (F*  CirAut  («))[<?]  =  \(F*u)[q]\2 .  Also  note  that  the  DFT  commutes  with 
the  reversal  operator: 

(F*Ru)[q]=  J2  u[-p\e~2nipq/p  =  u[p']e~2nip'(~q)/P  =  (F*u)[-q\  =  (RF*u)[q\. 

peZp  p'eZp 

With  this,  we  can  express  CirAut (x  +  Ilx)  in  terms  of  intensity  measurements  with  a 
particular  ensemble: 

(F*  CirAut (x  +  Ox))  [9]  =  |  (F*(x  +  Rx))  [g]  |2 

=  |  (F*x)  [q]  +  ( F*Rx)[q\\ ~  =  |(EV)[g]  +  (F*x)[-q) |2 

=  |  (X,fq  +  f-q)  |". 


Dehning  the  gth  discrete  cosine  function  cq  G  £(Z4ms)  by 


cq[p\ 


2  cos 


f  _2vr pq_\ 
\AM  -3J 


_  p2-Kipq/(AM— 3)  _|_  g— 2Tripq/(4M— 3) 


ifq  +  f-q)\p], 


this  means  that  (F*  CirAut (x  +  Rx))[q\  =  \(x,cq)\2  for  all  q  G  1j4m-3-  Similarly,  if 
we  take  the  modulation  matrix  E  to  have  diagonal  entries  u>k  —  e27r*fcA2M-i)  £or  ap 
k  —  0, . . . ,  AM  —  4,  we  find 


(F*  CirAut  (Ex  +  REx))[q\  =  |  (Ex,  cq)  |2  =  |  (x,  E*cq) 


Thus,  coupling  the  DFT  with  Theorem  5  allows  us  to  recover  the  signal  x  from  AM  —  2 
intensity  measurements,  namely  with  the  ensemble  {cq}2= 1q2  U  {E*cq}2= [q~2.  Note  that 
since  x  G  is  actually  a  zero-padded  version  of  x  G  CM,  we  may  view  cq  and 

E* cq  as  members  of  CA/  by  discarding  the  entries  indexed  by  p  —  M, . . . ,  AM  —  A. 

Considering  this  section  promised  phase  retrieval  from  only  AM  —  A  intensity  mea¬ 
surements,  we  must  somehow  find  a  way  to  discard  two  of  these  AM  —  2  measurement 
vectors.  To  do  this,  first  note  that 


CirAutf^-ET  +  REx)[  0]  =  \\Ex  +  REx\\2 


=  E 

fc€Z4M-3 


e2nik/(2M-l)x^  +  e27ri(-fc)/(2M-l)a.[_jfe] 


2 


=  |e27ri(-fc)/(2M-1)a.[_Jfe]|2+|2a;[o] 

k=-(2M-2) 
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2M-2 

+  |  e2nik/m-l)x[k\ 

k=  1 

=  ||a:  +  Rx\\2 

—  CirAut(®  +  A®)[0]. 


Moreover,  we  have 

Cir Aut (Ex  +  REx )  [2 M  -  2]  =  (A®  +  REx)  [k]  (Ex  +  REx)  [k  -  (2 M  -  2)] 

fc€^4M-3 

=  (A®  +  AA®)[M  -  1](A®  +  ~REx)  [— (M  —  1)] 

=  (Ex  +  REx)[M  —  l](Ex  +  REx)[M  —  1], 

where  the  last  equality  is  by  even  symmetry.  Since  ®  is  only  supported  on  k  —  0, . . . , 
M  —  1,  we  then  have 

Cir  Aut  (Ex  +  REx )  [2M  -  2]  =  |  (Ex  +  A  A® )  [M  -  1]  | 2 

=  |e2-(M-1)/(2M-i)®[M_i] 

+  e-2-(M-l)/(2M-l)a.[_(M  _  |2 

=  |e2-i(M-l)/(2M-l)a.[M  _  ^  =  \x[M  _  x]|2 

=  Cir  Aut  (®  +  Rx)[2M  —  2], 

Furthermore,  the  even  symmetry  of  the  circular  autocorrelation  also  gives 

Cir  Aut  (A®  +  REx)  [—(2  M  —  2)]  =  Cir  Aut  (A®  +  REx)[2M  —  2] 

—  CirAut(®  +  Rx)[2M  —  2] 

—  Cir  Aut  (®  +  Rx)  [—  (2M  —  2)] . 

These  redundancies  between  Cir  Aut  (®  +  Rx)  and  Cir  Aut  (A®  +  REx)  indicate  that  we 
might  be  able  to  remove  measurement  vectors  from  our  ensemble  while  maintaining  our 
ability  to  perform  phase  retrieval.  The  following  theorem  confirms  this  suspicion: 

Theorem  6.  Let  cq  E  CM  be  the  truncated  discrete  cosine  function  defined  by  cq\p] 

2  cos(2”pq3)  for  all  p  —  0,  ...,M  —  1,  and  let  A  be  the  M  x  M  diagonal  modu¬ 
lation  operator  with  diagonal  entries  uik  —  e2'*lk  ’  ^2M -1)  for  all  k  —  0,...,M  —  1. 
Then  the  intensity  measurement  mapping  A:  CM/T  — >  R4M~4  defined  by  A(x)  \— 
{\(xicq)\2}2q=o2  u  {\(x,E*cq)\2}2q=fB  ^  injective. 

Proof.  Since  Theorem  5  allows  us  to  reconstruct  any  ®  E  CM  up  to  a  global  phase  factor 
from  the  entries  of  Cir  Aut  (®  +  Rx)  and  Cir  Aut  (A®  +  REx),  it  suffices  to  show  that 

the  intensity  measurements  { | (® ,  cQ) | 2 }^=0^2  CJ  { | (®,  E*cq)\2}2  allow  us  to  recover  the 
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entries  of  these  circular  autocorrelations.  To  this  end,  recall  from  (2)  that  these  quantities 
are  related  through  the  inverse  DFT: 

CirAutO  +  ife)  =  (J-'-f  l 

CirAut (Ex  +  REx)  =  (F*)-'{  K*X%>|2},SZl„_3. 


Since  we  have  { | (a?,  Cq}|2}q=0-2,  we  can  exploit  even  symmetry  to  determine  the  rest  of 
{\{x,cq)\2}q<zzAM-3i  and  then  apply  the  inverse  DFT  to  get  CirAut (x  +  Rx).  Moreover, 
by  the  previous  discussion,  we  also  obtain  the  0,  2 M  —  2,  and  —(2 M  —  2)  entries  of 
CirAut  (Ax  +  REx )  from  the  corresponding  entries  of  CirAut  (x  +  Rx).  Organize  this 
information  about  CirAut  (Ax  +  REx )  into  a  vector  w  E  £(Z4M_3)  whose  0,  2  M  —  2, 
and  —(2  M  —  2)  entries  come  from  CirAut  (Ax  +  REx)  and  whose  remaining  entries  are 
populated  by  even  symmetry  from  {|(x,  A*cg)|2}^=1-3.  We  can  express  w  as  a  matrix- 
vector  product  w  —  A{|(x,  A*Cq)|2}q6z4M-3)  where  A  is  the  identity  matrix  with  the  0, 
2 M  —  2,  and  —(2 M  —  2)  rows  replaced  by  the  corresponding  rows  of  the  inverse  DFT 
matrix.  To  complete  the  proof,  it  suffices  to  show  that  the  matrix  A  is  invertible,  since 
this  would  imply  CirAut(Ax  +  REx)  —  (F*)^1A~1w. 

Using  the  cofactor  expansion,  note  that  det(A)  reduces  to  a  determinant  of  a  3  x  3 
submatrix  of  (A*)-1.  Specifically,  letting  6  :=  2ty(2M  —  2)2/(4 M  —  3)  we  have 


/ 

"l 

1 

1 

\ 

det(A)  =  det 

1 

ei6 

e~ie 

V 

1  e~id 

eie 

) 

=  {e2ie  -  e~2id)  -  {eie  -  e~ie )  +  ( e~ie  -  eid) 

=  (eie  +  e~ie  -2)  (eie  -e~ie) 

=  4i(cos(6))  —  l)  sin(6)), 


and  so  A  is  invertible  if  and  only  if  cos (9)  —  1^0  and  sin(6l)  ^  0.  This  equivalent  to 
having  7r  not  divide  6 ,  and  indeed,  the  ratio 

6  2(2M  —  2)2  5  1 

—  =  — - —  —  2  M - 1 - 

7T  4M-3  2  2(4M  —  3) 

is  not  an  integer  because  M  ^  2.  As  such,  A  is  invertible.  □ 


We  conclude  this  section  by  summarizing  our  measurement  design  and  phase  retrieval 
procedure: 

Measurement  design 

•  Define  the  qt\\  truncated  discrete  cosine  function  cq  {2  cos(  1m-z ) ip^C)1 

•  Define  the  M  x  M  diagonal  matrix  A  with  entries  uik  ■—  e27r*fc/(2M_1)  for  all  k  — 
0, . . . ,  M  —  1 

.  Take  $  :=  {cq}™~2  U  {A*cJ2ff 3 
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Phase  retrieval  procedure 

•  Calculate  {[(at, cq)\2}q€z4M_3  from  {|(ar, cq)\2}2q=Q2  by  even  extension 

•  Calculate  CirAut(x  +  Rx)  —  (F*)~1{\(x,cq)\2}q€z4M_3 

•  Define  w  G  (KZ4M-3)  so  that  its  0,  2 M  —  2,  and  —(2 M  —  2)  entries  are  the  corre¬ 
sponding  entries  in  CirAut(x  +  Rx)  and  its  remaining  entries  are  populated  by  even 
symmetry  from  {|(ar,  E*cq)\2}21^fi3 

•  Define  A  to  be  the  identity  matrix  with  the  0,  2 M  —  2,  and  —(2 M  —  2)  rows  replaced 
by  the  corresponding  rows  of  the  inverse  DFT  matrix  (F*)~i 

•  Calculate  CirAut(F:r  +  REx )  =  (F*)~1A~1w 

•  Recover  x  up  to  global  phase  from  Cir Aut  (x  +  Rx)  and  CirAutfF.x  +  REx)  using 
the  process  described  in  the  proof  of  Theorem  5 

3.  Almost  injectivity 

While  4 M  +  o(M)  measurements  are  necessary  and  generically  sufficient  for  injectivity 
in  the  complex  case,  one  can  save  a  factor  of  2  in  the  number  of  measurements  by 
slightly  weakening  the  desired  notion  of  injectivity  [4,28].  To  be  explicit,  we  start  with 
the  following  definition: 

Definition  7.  Consider  —  {ipn}n= l  —  ^A/  ■  The  intensity  measurement  mapping 
A:Rm/{±1}  — *  Rn  defined  by  (A(x))(n)  \(x,tpn)\2  is  said  to  be  almost  injective 

if  A^1(A(x))  =  {±x}  for  almost  every  x  G  RM. 

The  above  definition  specifically  treats  the  real  case,  but  it  can  be  similarly  defined 
for  the  complex  case  in  the  obvious  way.  For  the  complex  case,  it  is  known  that  2 M  mea¬ 
surements  are  necessary  for  almost  injectivity  [28],  and  that  2 M  generic  measurements 
suffice  4]  (cf.  [27]);  this  is  the  factor-of-2  savings  mentioned  above.  For  the  real  case,  it  is 
also  known  how  many  measurements  are  necessary  and  generically  sufficient  for  almost 
injectivity:  M  +  1  [4],  Like  the  complex  case,  this  is  also  a  factor-of-2  savings  from  the 
injectivity  requirement:  2 M  —  1.  This  requirement  for  injectivity  in  the  real  case  follows 
from  the  following  result  from  [4],  which  we  prove  here  because  the  proof  is  short  and 
inspires  the  remainder  of  this  section: 

Theorem  8.  Consider  $  —  {<pn}n= i  —  and  the  intensity  measurement  mapping 
A:Rm/{±1}  — >  Rn  defined  by  (A(x))(n)  :=  \{x,ipn)\2.  Then  A  is  injective  if  and  only 
if  for  every  S  C  {1, . . . ,  N},  either  {<pn}n eS  or  {^n}nes-  spans  RM . 

Proof.  We  will  prove  both  directions  by  obtaining  the  contrapositives. 

(=>)  Assume  there  exists  S  C  {1,  such  that  neither  {<pn}n€S  nor  {g>n}nesc 

spans  Ra/.  This  implies  that  there  are  nonzero  vectors  u,  v  G  RM  such  that  {u,<pn)  =  0 

for  all  n  G  S  and  {v,<pn)  =  0  for  all  n  G  Sc.  For  each  n,  we  then  have 
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I  ,  2  |  1 2  I  .  2  |  1 2  |  ,  2 

I  (u±V,tpn)\  =  |(w,<£n)|  ±  2(u,ipn)(v,(pn)  +  \{v,ipn)\  =  |(u,<£n)|  +  |(w,V?n)|  • 

Since  |(«  +  v,  +n)\2  —  \  {u  —  v,  tpn)\2  for  every  n,  we  have  A{u  +  v )  —  A(u  —  v ).  Moreover, 
u  and  v  are  nonzero  by  assumption,  and  so  u  +  v  ^  +{u  —  v). 

(<=)  Assume  that  A  is  not  injective.  Then  there  exist  vectors  x,  y  G  RM  such  that 
x  ±  ±y  and  A(x)  =  A{y).  Taking  S  :=  {n:  (x,  <pn)  =  -(y,  <£n)},  we  have  (x  +  y,  ipn)  =  0 
for  every  n  G  S.  Otherwise  when  n  G  Sc,  we  have  (x,  tpn)  —  (y,  +n)  and  so  (x  —  y,  (fn)  —  0. 
Furthermore,  both  x  +  y  and  x  —  y  are  nontrivial  since  x  ^  +y,  and  so  neither  {+n}nes 
nor  {< pn}nesc  spans  MM.  □ 

Similar  to  the  above  result,  in  this  section,  we  characterize  ensembles  of  measurement 
vectors  which  yield  almost  injective  intensity  measurements,  and  similar  to  the  above 
proof,  the  basic  idea  behind  our  analysis  is  to  consider  sums  and  differences  of  signals 
with  identical  intensity  measurements.  Our  characterization  starts  with  the  following 
lemma: 

Lemma  9.  Consider  <P  —  {<pn}n= l  —  ant ^  ^e  intensity  measurement  mapping 

M:Mm/{±1}  — >  Kw  defined  by  (M(x))(n)  :=  \{x,ipn)\2.  Then  A  is  almost  injective 
if  and  only  if  almost  every  x  G  Mm  is  not  in  the  Minkowski  sum  span(<P>g)±  \  {0}  + 
span(<?5c)J-  \  {0}  for  all  S  C  {1, ... ,  N}.  More  precisely,  A~1(A(x))  —  {±at}  if  and  only 
if  x  £  span(^5)±  \  {0}  +  span(<?5c)J-  \  {0}  for  any  S  C  {1, . . . ,  N}. 

Proof.  By  the  definition  of  the  mapping  A ,  for  x,  y  G  RA/  we  have  A(x)  =  A(y)  if  and 
only  if  |(x,  <pn)\  —  \  (y,  +n)\  for  all  n  G  {1, . . . ,  N}.  This  occurs  precisely  when  there  is  a 
subset  S  C  {1, . . . ,  N}  such  that  (x,  c pn )  —  —(y,  ipn)  for  every  n  G  S  and  (x,  ipn)  —  (y,  tpn) 
for  every  n  G  Sc.  Thus,  A~1(A(x))  —  {±x}  if  and  only  if  for  every  y  ^  +x  and  for  every 
5C{1,...,  N},  either  there  exists  an  n  G  S  such  that  (x  +  y,  ipn)  ^  0  or  an  n  G  Sc  such 
that  (x  —  y,  p>n)  ~f~  0.  We  claim  that  this  occurs  if  and  only  if  x  is  not  in  the  Minkowski 
sum  span(<?,s')±  \  {0}  +  span^gc)1-  \  {0}  for  all  S  C  {1, . . . ,  N},  which  would  complete 
the  proof.  We  verify  the  claim  by  seeking  the  contrapositive  in  each  direction. 

(=>)  Suppose  x  G  span(<?s)-L  \  {0}  +  span(^sc)1-  \  {0}.  Then  there  exist  u  G 
span(<?5)J-  \  {0}  and  v  G  span(^5c)J-  \  {0}  such  that  x  =  u  +  v.  Taking  y  u  —  v,  we 
see  that  x  +  y  —  2u  G  spanks)1-  \  {0}  and  x  —  y  —  2v  G  span(^s-c)-1-  \  {0},  which  means 
that  there  is  no  n  G  S  such  that  (x  +  y,  ipn)  0  nor  n  G  Sc  such  that  (x  —  y,  tpn)  0. 
Furthermore,  u  and  v  are  nonzero,  and  so  y  ^  ±x. 

(4=)  Suppose  y  ±x  and  for  every  S  C  {1,...,A^}  there  is  non  G  5  such  that 
(x  +  y,  <pn)  0  nor  n  G  Sc  such  that  (x  —  y,  ipn)  0.  Then  x  +  y  G  spanks)-1  \  {0} 
and  x  —  y  G  span^gc)1-  \  {0}.  Since  x  =  \{x  +  y)  +  \{x  —  y),  we  have  that  x  G 
span(<?5)-L  \  {0}  +  span(<?Sc)-L  \  {0}.  □ 

Theorem  10.  Consider  <P  —  {+n}n=i  C  RM  and  the  intensity  measurement  mapping 

M:Mm/{±1}  — >  Rn  defined  by  (A(x))(n)  \{x,tpn)\2.  Suppose  <P  spans  and  each 
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(pn  is  nonzero.  Then  A  is  almost  injective  if  and  only  if  the  Minkowski  sum  span(<?5)±  + 
span(<?5c)J-  is  a  proper  subspace  ofRAI  for  each  nonempty  proper  subset  S  C  {1, . . . ,  TV}. 

Note  that  the  above  result  is  not  terribly  surprising  considering  Lemma  9,  as  the 
new  condition  involves  a  simpler  Minkowski  sum  in  exchange  for  additional  (reasonable 
and  testable)  assumptions  on  <d>.  The  proof  of  this  theorem  amounts  to  measuring  the 
difference  between  the  two  Minkowski  sums: 

Proof  of  Theorem  10.  First  note  that  the  spanning  assumption  on  <P  implies 

span(^g)_L  D  span(<?5c)-L  =  (span(<5g)  +  span^gc))"1  =  span(<?)±  =  {0}, 
and  so  one  can  prove  the  following  identity: 

span(<?s)_L  \  {0}  +  span^sc)-1  \  {0} 

=  (span(<?5)±  +  span(<?5c)J-)  \  (spanks)-1  U  span^gc)-1).  (15) 

From  Lemma  9  we  know  that  A  is  almost  injective  if  and  only  if  almost  every  x  E  is 
not  in  the  Minkowski  sum  spanks)-1  \  {0}  -bspan^gc)1-  \  {0}  for  any  S  C  {1, . . . ,  N}.  In 
other  words,  the  Lebesgue  measure  (which  we  denote  by  Leb[-])  of  this  Minkowski  sum  is 
zero  for  each  S  C  {1, . . . ,  TV}.  By  (15),  this  equivalently  means  that  the  Lebesgue  measure 
of  (span(^g)-L+span(^gc)-L)\(span(<?g)-LUspan(<?gc)-L)  is  zero  for  each  S  C  {1, . . .  ,N}. 
Since  (I>  spans  MA/ ,  this  set  is  empty  (and  therefore  has  Lebesgue  measure  zero)  when 
5  =  0orS  =  {l,...,  TV}.  Also,  since  each  ipn  is  nonzero,  we  know  that  span^g)-1  and 
span(<?5c)J-  are  proper  subspaces  of  1RA/  whenever  S  is  a  nonempty  proper  subset  of 
{1, . . . ,  Ar}.  and  so  in  these  cases  both  subspaces  must  have  Lebesgue  measure  zero.  As 
such,  we  have  that  for  every  nonempty  proper  subset  S  C  {1, . . . ,  Ar}, 

Leb  [(span (f^g)-1  +  span^gc)-1)  \  (spanks)-1  U  span  (<5gc  )-*-)] 

^  Leb  [span (f^g)-1  +  span^gc)-1]  —  Leb  [span (f^g)-1]  —  Leb [span^gc)-1] 

=  Leb  [span (^s)"1  +  span^gc)-1] 

^  Leb  [(span(<?g)_L  +  span(<?gc)-L)  \  (span(<?g)_L  U  span(<?gc )"*“)]  • 

In  summary,  (span(<?g)-L  +  span^gc)-1)  \  (span(<Pg)-L  U  span(^gc)-1-)  having  Lebesgue 
measure  zero  for  each  S  C  {1, . . . ,N}  is  equivalent  to  span(^g)-1-  +  span(^gc)-1-  having 
Lebesgue  measure  zero  for  each  nonempty  proper  subset  S  C  {1, . . . ,  Ar}.  which  in  turn 
is  equivalent  to  the  Minkowski  sum  span(<?g)±  +  span(<?gc)-L  being  a  proper  subspace 
of  Ma/  for  each  nonempty  proper  subset  S  C  {1, ... ,  N},  as  desired.  □ 

At  this  point,  consider  the  following  stronger  restatement  of  Theorem  10:  “Suppose 

each  ipn  is  nonzero.  Then  A  is  almost  injective  if  and  only  if  <P  spans  MA/  and  the 
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Minkowski  sum  span  (As +span(^5c)J-  is  a  proper  subspace  of  MA/  for  each  nonempty 
proper  subset  S  C  {1, . . .  ,N}.”  Note  that  we  can  move  the  spanning  assumption  into 
the  condition  because  if  4?  does  not  span,  then  we  can  decompose  almost  every  x  G  KA/ 
as  x  —  u  +  v  such  that  u  G  span(<?)  and  v  G  span  (A)1-  with  v  A  0,  and  defining 
y  u  —  v  then  gives  A(y)  —  A(x)  despite  the  fact  that  y  A  ±x.  As  for  the  assumption 
that  the  (fin’s  are  nonzero,  we  note  that  having  ipn  —  0  amounts  to  having  the  nth 
entry  of  A{x)  be  zero  for  all  x.  As  such,  4>  yields  almost  injectivity  precisely  when  the 
nonzero  members  of  4?  together  yield  almost  injectivity.  With  this  identification,  the 
stronger  restatement  of  Theorem  10  above  can  be  viewed  as  a  complete  characterization 
of  almost  injectivity.  Next,  we  will  replace  the  Minkowski  sum  condition  with  a  rather 
elegant  condition  involving  the  ranks  of  As  and  As,:  : 

Theorem  11.  Consider  4>  —  {(fin}n=i  —  and  the  intensity  measurement  mapping 
A:Rm/{±1}  — »  Rn  defined  by  (A(x))(n)  \(x,tpn)\2.  Suppose  each  ipn  is  nonzero. 

Then  A  is  almost  injective  if  and  only  if  4>  spans  MM  and  rank  4>s  +  rank  4>s^  >  M  for 
each  nonempty  proper  subset  S  C  {1, . . . ,  A^}. 

Proof.  Considering  the  discussion  after  the  proof  of  Theorem  10,  it  suffices  to  assume 
that  4>  spans  1RM.  Furthermore,  considering  Theorem  10,  it  suffices  to  characterize  when 
dim  (span  (^5)^  +span(<?5c)-L)  <  M.  By  the  inclusion-exclusion  principle  for  subspaces, 
we  have 

dim(span(^5)±  +  span^gc)-1) 

=  dim(span(^5)±)  +  dim  (span  (<Tsc)±)  ~  dim  (span  (^5  j-1  fl  span(<?5c)±) . 

Since  4>  is  assumed  to  span  Mm,  we  also  have  that  spanks)1-  fl  span(<?5c)J-  =  {0},  and 
so 

dim  (span  (@s) ±  +  span(<?5c)-L)  =  (M  —  dim  (span  (@s)))  +  (M  —  dim  (span  (<&$<=)))  —  0 

=  2  M  —  rank<?5  —  rank^gc. 

As  such,  dim(span(^5)-L+span(<?sc)-L)  <  M  precisely  when  rank 4? 5  +rank<?5c  >  M.  □ 

At  this  point,  we  point  out  some  interesting  consequences  of  Theorem  11.  First  of  all, 
4>  cannot  be  almost  injective  if  N  <  M  +  1  since  rank 4>s  +  rankle  ^  |5|  +  |AC|  =  N. 
Also,  in  the  case  where  N  =  M  +  1,  we  note  that  4>  is  almost  injective  precisely  when  4> 
is  full  spark,  that  is,  every  size-M  subcollection  is  a  spanning  set  (note  this  implies  that 
all  of  the  (fin’s  are  nonzero).  In  fact,  every  full  spark  4>  with  N  ^  M  +  1  yields  almost 
injective  intensity  measurements,  which  in  turn  implies  that  a  generic  4>  yields  almost 
injectivity  when  N  ^  M  +  1  [4].  This  is  in  direct  analogy  with  injectivity  in  the  real 
case;  here,  injectivity  requires  N  ^  2 M  —  1,  injectivity  with  N  —  2 M  —  1  is  equivalent  to 

being  full  spark,  and  being  full  spark  suffices  for  injectivity  whenever  N  A  2 M  —  1  [4]. 
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Another  thing  to  check  is  that  the  condition  for  injectivity  implies  the  condition  for 
almost  injectivity  (it  does). 

Having  established  that  full  spark  ensembles  of  size  N  ^  M  +  1  yield  almost  injective 
intensity  measurements,  we  note  that  checking  whether  a  matrix  is  full  spark  is  NP-hard 
in  general  [33].  Granted,  there  are  a  few  explicit  constructions  of  full  spark  ensembles 
which  can  be  used  [2,36],  but  it  would  be  nice  to  have  a  condition  which  is  not  com¬ 
putationally  difficult  to  test  in  general.  We  provide  one  such  condition  in  the  following 
theorem,  but  first,  we  briefly  review  the  requisite  frame  theory. 

A  frame  is  an  ensemble  $  —  {ipn}^_ x  C  KM  together  with  frame  bounds  0  <  A  ^ 
B  <  oo  with  the  property  that  for  every  x  E  Mm  , 

N 

a\\x\\2  <  y^i(x,yn)i2  <  H||xii2. 

n=  1 

When  A  —  B,  the  frame  is  said  to  be  tight ,  and  such  frames  come  with  a  painless 
reconstruction  formula: 


®  =  -j  ^T{x:ipn)(pn. 

n=  1 

To  be  clear,  the  theory  of  frames  originated  in  the  context  of  infinite-dimensional  Hilbert 
spaces  [22,24],  and  frames  have  since  been  studied  in  finite-dimensional  settings,  primar¬ 
ily  because  this  is  the  setting  in  which  they  are  applied  computationally.  Of  particular 
interest  are  so-called  unit  norm  tight  frames  (UNTFs),  which  are  tight  frames  whose 
frame  elements  have  unit  norm:  ||<^n||  =  1  for  every  n  —  1  Such  frames  are 

useful  in  applications;  for  example,  if  one  encodes  a  signal  x  using  frame  coefficients 
(x,<pn)  and  transmits  these  coefficients  across  a  channel,  then  UNTFs  are  optimally  ro¬ 
bust  to  noise  [29]  and  one  erasure  [17].  Intuitively,  this  optimality  comes  from  the  fact 
that  frame  elements  of  a  UNTF  are  particularly  well-distributed  in  the  unit  sphere  [7]. 
Another  pleasant  feature  of  UNTFs  is  that  it  is  straightforward  to  test  whether  a  given 
frame  is  a  UNTF :  Letting  —  [ipi  ■  ■  ■  ipjg]  denote  an  M  x  N  matrix  whose  columns 
are  the  frame  elements,  then  <h>  is  a  UNTF  precisely  when  each  of  the  following  occurs 
simultaneously: 

(i)  the  rows  have  equal  norm 

(ii)  the  rows  are  orthogonal 

(iii)  the  columns  have  unit  norm 

(This  is  a  direct  consequence  of  the  tight  frame’s  reconstruction  formula  and  the  fact  that 
a  UNTF  has  unit-norm  frame  elements;  furthermore,  since  the  columns  have  unit  norm, 
it  is  not  difficult  to  see  that  the  rows  will  necessarily  have  norm  y/N/M.)  In  addition  to 

being  able  to  test  that  an  ensemble  is  a  UNTF,  various  UNTFs  can  be  constructed  using 
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spectral  tetris  [16]  (though  such  frames  necessarily  have  N  ^  2 M),  and  every  UNTF  can 
be  constructed  using  the  recent  theory  of  eigensteps  [11,26].  Now  that  UNTFs  have  been 
properly  introduced,  we  relate  them  to  almost  injectivity  for  phase  retrieval: 

Theorem  12.  If  M  and  N  are  relatively  prime,  then  every  unit  norm  tight  frame  <P  — 
Wn)n=i  C  Rm  yields  almost  injective  intensity  measurements. 

Proof.  Pick  a  nonempty  proper  subset  S  C  {1, . . .  ,N}.  By  Theorem  11,  it  suffices  to 
show  that  rank ^5  +  rankle  >  M,  or  equivalently,  rank +  rank<?sc<5gc  >  M. 
Note  that  since  P  is  a  unit  norm  tight  frame,  we  also  have 

$s$*s  +  $s°$*sc=***  =  Jj1’ 

and  so  Ps^s  and  <I>sc<P*sc  are  simultaneously  diagonalizable,  i.e. ,  there  exists  a  unitary 
matrix  U  and  diagonal  matrices  D\  and  D2  such  that 


UDiU*  +  UD2U *  =  d>sd>*s  +  (I\s<(P*Sc  =  —I. 

Conjugating  by  U*,  this  then  implies  that  D\  +  D2  —  jjl.  Let  L\  C  {1, ... ,  M}  denote 
the  diagonal  locations  of  the  nonzero  entries  in  D\,  and  L2  C  {1  ,...,M}  similarly 
for  D2.  To  complete  the  proof,  we  need  to  show  that  |Li|  +  \L2\  >  M  (since  |Li|  +  \L2\  — 
rank  d>sP*s  +  rank<?5c<?^c).  Note  that  Li  U  L2  ^  {1, . . . ,  M}  would  imply  that  D\  +  D2 
has  at  least  one  zero  in  its  diagonal,  contradicting  the  fact  that  D\  +  D2  is  a  nonzero 
multiple  of  the  identity;  as  such,  L\  U  L2  —  {1, . . . ,  M}  and  |Li|  +  \L2\  ^  M.  We  claim 
that  this  inequality  is  strict  due  to  the  assumption  that  M  and  N  are  relatively  prime. 
To  see  this,  it  suffices  to  show  that  L\  Cl  L2  is  nonempty.  Suppose  to  the  contrary  that 
L 1  and  L2  are  disjoint.  Then  since  D\  +  D2  —  every  nonzero  entry  in  Di  must  be 
N/M.  Since  S'  is  a  nonempty  proper  subset  of  {1, . . . ,  jY}.  this  means  that  there  exists 
K  G  (0,  M)  such  that  D\  has  K  entries  which  are  N/M  and  M  —  K  which  are  0.  Thus, 

|S|  =  Tr[<2>^s]  =  Tr[^s]  =  Tr  [UD1U*}  =  Tr^}  =  K{N/M ), 

implying  that  N/M  —  \S\/K  with  K  ^  M  and  |S|  ^  N.  Since  this  contradicts  the 
assumption  that  N/M  is  in  lowest  form,  we  have  the  desired  result.  □ 

In  general,  whether  a  UNTF  $  yields  almost  injective  intensity  measurements  is  deter¬ 
mined  by  whether  it  is  orthogonally  partitionable:  <P  is  orthogonally  partitionable  if  there 
exists  a  partition  S  U  Sc  —  {1, . . . ,  A^}  such  that  span(^s)  is  orthogonal  to  span(<?5c). 
Specifically,  a  UNTF  yields  almost  injective  intensity  measurements  precisely  when  it 
is  not  orthogonally  partitionable.  Historically,  this  property  of  UNTFs  has  been  pivotal 
to  the  understanding  of  singularities  in  the  algebraic  variety  of  UNTFs  [25] ,  and  it  has 

also  played  a  key  role  in  solutions  to  the  Paulsen  problem  [8,15].  However,  it  is  not 

343 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


494 


M.  Fickus  et  al.  /  Linear  Algebra  and  its  Applications  449  (2014)  475-499 


Fig.  1.  The  simplex  in  R3.  Pointing  out  of  the  page  is  the  vector  ^-(1, 1,  1),  while  the  other  vectors  are  the 
three  permutations  of  -^(1,  —  1,  —1).  Together,  these  four  vectors  form  a  unit  norm  tight  frame,  and  since 
M  =  3  and  N  =  4  are  relatively  prime,  these  yield  almost  injective  intensity  measurements  in  accordance 
with  Theorem  12.  For  this  ensemble,  the  points  x  such  that  A~1{A(x))  ^  {±1}  are  contained  in  the  three 
coordinate  planes.  Above,  we  depict  the  intersection  between  these  planes  and  the  unit  sphere.  According 
to  Theorem  14,  performing  phase  retrieval  with  simplices  such  as  this  is  NP-hard. 


clear  in  general  how  to  efficiently  test  for  this  property;  this  is  why  Theorem  12  is  so 
powerful. 

4.  The  computational  complexity  of  phase  retrieval 

The  previous  section  characterized  the  real  ensembles  which  yield  almost  injective 
intensity  measurements.  The  benefit  of  seeking  almost  injectivity  instead  of  injectivity  is 
that  we  can  get  away  with  much  smaller  ensembles.  For  example,  a  full  spark  ensemble  in 
Mm  of  size  M  +  1  suffices  for  almost  injectivity  (see  Fig.  1),  while  2 M  —  1  measurements 
are  required  for  injectivity.  In  this  section,  we  demonstrate  that  this  savings  in  the 
number  of  measurements  can  come  at  a  substantial  price  in  computational  requirements 
for  phase  retrieval.  In  particular,  we  consider  the  following  problem: 

Problem  13.  Let  T  —  {&m}m= 2  be  a  family  of  ensembles  (Pm  =  {(PM-n}n=i'>  — 
where  N(M)  =  poly(M).  Then  ConsistentIntensitiesJJ7]  is  the  following  problem: 
Given  M  ^  2  and  a  rational  sequence  {bnj^L ,i  ,  does  there  exist  x  E  RM  such  that 
\{x,<PM;n)\  =  bn  for  every  n  =  1, . . .  ,N(M )? 

In  this  section,  we  will  evaluate  the  computational  complexity  of  Consistent- 
Intensities  [J7]  for  a  large  class  of  families  of  small  ensembles  J7,  but  first,  we  briefly  re¬ 
view  the  main  concepts  involved.  Complexity  theory  is  chiefly  concerned  with  complexity 
classes,  which  are  sets  of  problems  that  share  certain  computational  requirements,  such 
as  time  or  space.  For  example,  the  complexity  class  P  is  the  set  of  problems  which  can 

be  solved  in  an  amount  of  time  that  is  bounded  by  some  polynomial  of  the  bit-length 
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of  the  input.  As  another  example,  NP  contains  all  problems  for  which  an  affirmative  an¬ 
swer  comes  with  a  certificate  that  can  be  verified  in  polynomial  time;  note  that  P  C  NP 
since  for  every  problem  A  E  P,  one  may  ignore  the  certificate  and  find  the  affirmative 
answer  in  polynomial  time.  One  key  tool  that  is  used  to  evaluate  the  complexity  of  a 
problem  is  called  polynomial- time  reduction.  This  is  a  polynomial-time  algorithm  that 
solves  a  problem  A  by  exploiting  an  oracle  which  solves  another  problem  B,  indicating 
that  solving  A  is  no  harder  than  solving  B  (up  to  polynomial  factors  in  time);  if  such  a 
reduction  exists,  we  write  A  ^  B.  For  example,  any  efficient  phase  retrieval  procedure 
for  T  can  be  used  as  a  subroutine  to  solve  ConsistentIntensities  [J7] ,  indicating  that 
phase  retrieval  for  T  is  at  least  as  hard  as  ConsistentIntensities  [.F].  A  problem  B  is 
called  NP -hard  if  B  ^  A  for  every  problem  A  G  NP.  Note  that  since  ^  is  transitive,  it 
suffices  to  show  that  B  ^  C  for  some  NP-hard  problem  C.  Finally,  a  problem  B  is  called 
NP -complete  if  B  e  NP  is  NP-hard;  intuitively,  NP-complete  problems  are  the  hardest 
of  problems  in  NP.  It  is  an  open  problem  whether  P  =  NP,  but  inequality  is  widely  be¬ 
lieved  [20];  note  that  under  this  assumption,  NP-hard  problems  have  no  computationally 
efficient  solution.  This  provides  a  proper  context  for  the  main  result  of  this  section: 

Theorem  14.  Let  T  —  {^m}m= 2  be  a  family  of  full  spark  ensembles  <Pm  — 
{y °M-,n}n=i'  —  w^h  rational  entries  that  can  be  computed  in  polynomial  time.  Then 
ConsistentIntensities  [J7]  is  NP -complete. 

Note  that  since  the  ensembles  <L>m  are  full  spark,  the  existence  of  a  solution  to  the 
phase  retrieval  problem  \(x,<pM;n)\  —  bn  for  every  n  —  1,  ...,M  +  1  implies  unique¬ 
ness  by  Theorem  11.  Before  proving  this  theorem,  we  first  relate  it  to  a  previous 
hardness  result  from  [37].  Specifically,  this  result  can  be  restated  using  the  termi¬ 
nology  in  this  paper  as  follows:  There  exists  a  family  T  —  {<Pm}m= 2  °f  ensembles 
(Pm  —  {(pM;n}n=i  —  each  °f  which  yielding  almost  injective  intensity  measure¬ 

ments,  such  that  ConsistentIntensities  [J7]  is  NP-complete.  Interestingly,  these  are 
the  smallest  possible  almost  injective  ensembles  in  the  complex  case,  and  we  suspect 
that  the  result  can  be  strengthened  to  the  obvious  analogy  of  Theorem  14: 

Conjecture  15.  Let  T  —  {<L>m}m= 2  be  a  family  of  ensembles  T>m  —  {<PM;n}n= 1  — 
which  yield  almost  injective  intensity  measurements  and  have  complex  rational  en¬ 
tries  that  can  be  computed  in  polynomial  time.  Then  ConsistentIntensities  [J7]  is 
NP -complete. 

To  prove  Theorem  14,  we  devise  a  polynomial-time  reduction  from  the  following  prob¬ 
lem  which  is  well-known  to  be  NP-complete  32] : 

Problem  16  (^SubsetSum,).  Given  a  finite  collection  of  integers  A  and  an  integer  z,  does 

there  exist  a  subset  S  C  A  such  that  J2aes  a  — 
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Proof  of  Theorem  14.  We  first  show  that  ConsistentIntensities [J7]  is  in  NP.  Note 
that  if  there  exists  an  x  G  Mm  such  that  |(aj,  (fM-n)  \  —  bn  for  every  n  —  1, . . . ,  M  +  1, 
then  x  will  have  all  rational  entries.  Indeed,  v  :=  &*mX  has  all  rational  entries,  being  a 
signed  version  of  { bn and  so  x  —  is  also  rational.  Thus,  we  can 

view  x  as  a  certificate  of  finite  bit-length,  and  for  each  n=l,...,M  +  l,  we  know  that 
|(x,  <pM;n)  |  =  bn  can  be  verified  in  time  which  is  polynomial  in  this  bit-length,  as  desired. 

Now  we  show  that  ConsistentIntensities [J7]  is  NP-hard  by  reduction  from  Sub- 
SEtSum.  To  this  end,  take  a  finite  collection  of  integers  A  and  an  integer  z.  Set  M  |A| 
and  label  the  members  of  A  as  {am}„ f=1.  Let  4/  denote  the  M  xM  matrix  whose  columns 
are  the  first  M  members  of  4>M  •  Since  4> m  is  full  spark,  4/  is  invertible  and  has 

the  form  [I  w\,  where  w  has  all  nonzero  entries;  indeed,  if  the  mth  entry  of  w  were  zero, 
then  4>m  \  {<PM;m}  would  not  span,  violating  full  spark.  Now  define 


bn  :  = 


lUr 


M 


2z -E 


m—1 


if  n  —  1, . . . ,  M 
if  n  =  M  +  1. 


(16) 


We  claim  that  an  oracle  for  ConsistentIntensities [J7]  would  return  “yes”  from  the 
inputs  M  and  {6n}nJ^1  defined  above  if  and  only  if  there  exists  a  subset  S  C  A  such 
that  J2aes  a  —  zi  which  would  complete  the  reduction. 

To  prove  our  claim,  we  start  with  (=>):  Suppose  there  exists  x  G  RM  such  that 
\(x,<PM;n)\  =  bn  for  every  n  =  l,...,M+l.  Then  y  :=  <P*x  satisfies  |(«/,^V«;ri}|  =  bn 
for  every  n=l,...,M  +  l.  Since  —  [I  w],  then  by  (16),  the  entries  of  y  satisfy 


dm 

'01m 

M 

M 

I= 

T— 1 

II 

> 

^  ^  Vm^m 
m=  1 

= 

2 Z  ^  ^  dm 

m=  1 

By  the  first  equation  above,  there  exists  a  sequence  {£m}m=i  °f  ±l’s  such  that  yrn  — 
for  every  m  —  1, ,  M,  and  so  the  second  equation  above  gives 


M 


2z-E 


m=  1 


M 

^  ^  DvriOlm 


M 

E 


771=1 

M 

771=1 

M 

M 

M 

^  ^  0-777, 

^  ^  O777, 

= 

2  ^  ^  dm 

^  ^  ^771 

771=1 
£m  =  1 

771  =  1 
£•  m  ~  1 

771=1 
£m  —  1 

771=1 

Removing  the  absolute  values,  this  means  the  left-hand  side  above  is  equal  to  the  right- 
hand  side,  up  to  a  sign  factor.  At  this  point,  isolating  z  reveals  that  z  —  J2mesam’ 
where  S  is  either  {m:  em  —  1}  or  {m:  em  —  —  1},  depending  on  the  sign  factor. 

For  (4=),  suppose  there  is  a  subset  S  C  {1, . . . ,  M}  such  that  z  —  J2m.es  Define 

sm  1  when  m  G  S  and  £m  —1  when  m  ^  S.  Then 
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M 

^  ^  £mCZm 
m=  1 


m— 1  ra=l 

m  —  1  m  —  1 


M  M 


m=  1  m=l 

=  1 


M 

771=1 


By  the  analysis  from  the  (=>)  direction,  taking  ym  :=  SmCtm/wm  for  each  m  =  1, . . . ,  M 
then  ensures  that  |  ( y ,  ^~l^pM-,n)  |  =  bn  for  every  n  —  1, . . . ,  M  +  1,  which  in  turn  ensures 
that  a:  :=  (lF *)~1y  satisfies  |(x,  =  bn  for  every  n  =  1, . . . ,  M  +  1.  □ 


Based  on  Theorem  14,  there  is  no  polynomial-time  algorithm  to  perform  phase  re¬ 
trieval  for  minimal  almost  injective  ensembles,  assuming  P  ^  NP.  On  the  other  hand, 
there  exist  ensembles  of  size  2 M  —  1  for  which  phase  retrieval  is  particularly  efficient. 
For  example,  letting  5m-, ■ m  £  KM  denote  the  mth  identity  basis  element,  consider  the 
ensemble  (Pm  ■—  {^M;m}m=i  U {^m;i  +  then  one  can  reconstruct  (up  to  global 

phase)  any  x  whose  first  entry  is  nonzero  by  first  taking  £[1]  :=  |(#,  5m;i}|,  and  then 
taking 


x\m\ 


2x[l] 


2  . 

(\(x,5m-1  +  5M-m)\  -\(x,5m-i)\ 


Vm  —  2 , . . . ,  M. 


Intuitively,  we  expect  a  redundancy  threshold  that  determines  whether  phase  retrieval 
can  be  efficient,  and  this  suggests  the  following  open  problem:  What  is  the  smallest  C 
for  which  there  exists  a  family  of  ensembles  of  size  N  —  CM  +  o(M)  such  that  phase 
retrieval  can  be  performed  in  polynomial  time? 
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We  present  a  new  deterministic  algorithm  for  the  sparse  Fourier  transform  problem, 
in  which  we  seek  to  identify  k  <C  N  significant  Fourier  coefficients  from  a  signal  of 
bandwidth  N.  Previous  deterministic  algorithms  exhibit  quadratic  runtime  scaling,  while 
our  algorithm  scales  linearly  with  k  in  the  average  case.  Underlying  our  algorithm  are 
a  few  simple  observations  relating  the  Fourier  coefficients  of  time-shifted  samples  to 
unshifted  samples  of  the  input  function.  This  allows  us  to  detect  when  aliasing  between 
two  or  more  frequencies  has  occurred,  as  well  as  to  determine  the  value  of  unaliased 
frequencies.  We  show  that  empirically  our  algorithm  is  orders  of  magnitude  faster  than 
competing  algorithms. 

Keywords :  Fourier  algorithm. 


1.  Introduction 

The  Fast  Fourier  Transform  (FFT)  is  arguably  the  most  ubiquitous  numerical  algo¬ 
rithm  in  scientific  computing.  In  addition  to  being  named  one  of  the  “Top  Ten 
Algorithms”  of  the  past  century  [Dongarra  and  Sullivan  (2000)],  the  FFT  is  a  crit¬ 
ical  tool  in  myriad  applications,  ranging  from  signal  processing  to  computational 
PDE  and  machine  learning.  At  the  time  of  its  introduction,  it  represented  a  major 
leap  forward  in  the  size  of  problems  that  could  be  solved  on  available  hardware, 
as  it  reduces  the  runtime  complexity  of  computing  the  Discrete  Fourier  Transform 
(DFT)  of  a  length-iV  array  from  0(IV2)  to  0(N\ogN). 
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Any  algorithm  which  computes  all  N  Fourier  coefficients  has  a  runtime 
complexity  of  Q(N),  since  it  takes  that  much  time  merely  to  report  the  output. 
However,  in  many  applications  it  is  known  that  the  DFT  of  the  signal  of  interest 
is  highly  sparse  —  that  is,  only  a  small  number  of  coefficients  are  non-zero.  In  this 
case,  it  is  possible  to  break  the  Q(N)  barrier  by  asking  only  for  the  largest  k  terms 
in  the  signal’s  DFT.  When  k  N  existing  algorithms  can  significantly  outper¬ 
form  even  highly  optimized  FFT  implementations  [Iwen  et  al.  (2007);  Iwen  (2010); 
Hassanieh  et  al.  (2012b)]. 


1.1.  Related  work 

The  first  works  to  implicitly  address  the  sparse  approximate  DFT  problem  appeared 
in  the  theoretical  computer  science  literature  in  the  early  1990s.  In  Linial  et  al. 
[1993],  a  variant  of  the  Fourier  transform  for  Boolean  functions  was  shown  to  have 
applications  for  learnability.  A  polynomial-time  algorithm  to  find  large  coefficients 
in  this  basis  was  given  in  Kushilevitz  and  Mansour  [1993],  while  the  interpolation 
of  sparse  polynomials  over  finite  fields  was  considered  in  Mansour  [1995].  It  was 
later  realized  [Gilbert  et  al.  (2005)]  that  this  last  algorithm  could  be  considered  as 
an  approximate  DFT  for  the  special  case  when  N  is  a  power  of  two. 

In  the  past  10  or  so  years,  a  number  of  algorithms  have  appeared  which  directly 
address  the  problem  of  computing  sparse  approximate  Fourier  transforms.  When 
comparing  the  results  in  the  literature,  care  must  be  taken  to  identify  the  class 
of  signals  over  which  a  specific  algorithm  is  to  perform,  as  well  as  to  identify  the 
error  bounds  of  a  given  method.  Different  algorithms  have  been  devised  in  different 
research  communities,  and  so  have  varying  assumptions  on  the  underlying  signals 
as  well  as  different  levels  of  acceptable  error. 

The  first  result  with  sub-linear  runtime  and  sampling  requirements  appeared 
in  Gilbert  et  al.  [2002],  They  give  a  poly(fc,  log N,  log(l/<5),  1/e)  time  algorithm  for 
Ending,  with  probability  1  —  <5,  an  approximation  y  of  the  DFT  of  the  input  x  that 
is  nearly  optimal,  in  the  sense  that  ||x  —  y|||  <  (1  +e)||x  —  x0pt||!j  where  tropt  is  the 
best  fc-term  approximation  to  x.  Here,  the  exponent  of  k  in  the  runtime  is  two,  so 
the  algorithm  is  quadratic  in  the  sparsity.  Moreover,  the  algorithm  is  non-adaptive 
in  the  sense  that  the  samples  used  are  independent  of  the  input  x.  This  algorithm 
was  modified  in  Gilbert  et  al.  [2005]  to  bring  the  dependence  on  k  down  to  lin¬ 
ear. a  This  was  accomplished  mainly  by  replacing  uniform  random  variables  (used 
to  sample  the  input)  by  random  arithmetic  progressions,  which  allowed  the  use  of 
non-equispaced  FFTs  to  sample  from  intermediate  representations  and  to  estimate 
the  coefficients  in  near-linear  time.  The  increased  overhead  of  this  procedure,  how¬ 
ever,  limited  the  range  of  k  for  which  the  algorithm  outperformed  a  standard  FFT 
implementation  [Iwen  et  al.  (2007)]. 


aSee  Gilbert  et  al.  [2008]  for  a  “user-friendly”  description  of  the  improved  algorithm. 
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Around  the  same  time,  a  similar  algorithm  was  developed  in  the  context  of 
list  decoding  for  proving  hard-core  predicates  for  one-way  functions  [Akavia  et  al. 
(2003)].  This  can  be  considered  an  extension  of  Kushilevitz  and  Mansour  [1993], 
and  like  Gilbert  et  al.  [2002,  2005]  is  a  randomized  algorithm.  Since  the  goal  in 
this  work  was  to  give  a  polynomial-time  algorithm  for  list  decoding,  no  effort  was 
made  to  optimize  the  dependence  on  k;  it  stands  at  fc11/2,  considerably  higher 
than  Gilbert  et  al.  [2002,  2005].  The  randomness  in  this  algorithm  is  used  only  to 
construct  a  sample  set  on  which  norms  are  estimated,  and  in  Akavia  [2010]  this  set  is 
replaced  with  a  deterministic  construction.  This  construction  is  based  on  the  notion 
of  e- approximating  the  uniform  distribution  over  arithmetic  progressions,  and  relies 
on  existing  constructions  of  ^-biased  sets  of  small  size  [Katz  (1989);  Ajtai  et  al. 
(1990)].  Depending  on  the  size  of  the  £-biased  sets  used,  the  sampling  and  runtime 
complexities  are  O(fc4logc  N)  and  0(/c(Togc  N),  respectively,  for  some  c  >  4.b 

In  the  series  of  works  [Iwen  (2008,  2010,  2012)],  a  different  deterministic  algo¬ 
rithm  for  sparse  Fourier  approximation  was  given  that  relies  on  the  combinatorial 
properties  of  aliasing ,  or  collisions  among  frequencies  in  sub-sampled  DFTs.  By  tak¬ 
ing  enough  short  DFTs  of  co-prime  lengths,  and  employing  the  Chinese  Remainder 
Theorem  (CRT)  to  reconstruct  energetic  frequencies  from  their  residues  modulo 
these  sample  lengths,  the  author  is  able  to  prove  sampling  and  runtime  bounds  of 
O {k2  log4  N).  The  error  bound  is  of  the  form  ||x  —  y\\2  <  ||x  —  £opt||2  +  fc_1/2||£  — 
Xopt||i;  it  has  been  shown  that  the  stronger  “£2-^2”  guarantee  of  Gilbert  et  al.  [2005] 
cannot  hold  for  a  sub- linear,  deterministic  algorithm  [Cohen  et  al.  (2009)].  More¬ 
over,  the  range  of  k  for  which  this  algorithm  is  faster  than  the  FFT  is  smaller  in 
practice  than  that  of  Gilbert  et  al.  [2005]. 

Most  recently,  the  authors  of  Hassanieh  et  al.  [2012b]  presented  a  random¬ 
ized  algorithm  that  extends  by  an  order  of  magnitude  the  range  of  sparsity  for 
which  it  is  faster  than  the  FFT.  This  is  accomplished  by  removing  the  itera¬ 
tive  aspect  from  Gilbert  et  al.  [2005]  by  using  more  efficient  filters,  which  are 
nearly  flat  within  the  passband  and  which  decay  exponentially  outside.  In  con¬ 
trast,  the  box-car  filters  used  in  Gilbert  et  al.  [2005]  have  a  frequency  response 
which  oscillates  and  decays  like  [col^1.  In  addition,  the  identification  of  significant 
frequencies  is  done  by  direct  estimation  after  hashing  into  a  large  number  of  bins 
rather  than  the  binary  search  technique  of  Gilbert  et  al.  [2005].  These  changes 
give  a  runtime  bound  of  0(logN\/Nk  log  A”)  and  a  somewhat  stronger  error  bound 
||x  —  y || ^  <  ek^Wx  —  aGptll!  +  <5||£||2  with  probability  1  —  1/N,  where  e  >  0  and 
8  —  js  a  precision  parameter. 


b Specifically,  the  runtime  is  O (fc2  •  log  N  ■  |Si),  where  S  is  the  set  of  samples  read  by  the  algorithm. 
This  set  takes  the  form  S  —  U r = 1  A  —  B^ ,  where  A  has  e-discrepancy  on  rank  2  Bohr  sets, 
Bg  e-approximates  the  uniform  distribution  on  [0,  2(  —  1]  fl  Z,  and  A  —  B^  is  the  difference  set. 
Using  constructions  from  Katz  [1989]  one  has  |A|  =  0(e_1  log4  N ),  | B(\  —  Ofe-3  log4  AT);  setting 
e  —  0(/c  1 )  and  noting  that  |  (J  A  —  Bg\  =  0(^2  \  A~Bg\)  and  \  A  —  Bg\  =  0(|A|  |£?d)  [see>  e-S-  Tao 
and  Vu  (2006)]  one  obtains  the  stated  sampling  and  runtime  complexities. 
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These  existing  algorithms  generally  take  one  of  two  approaches  to  the  sparse 
Fourier  transform  problem.  In  Gilbert  et  al.  [2002],  Akavia  et  al.  [2003],  Gilbert 
et  al.  [2005]  and  Hassanieh  et  al.  [2012b],  the  spectrum  of  the  input  is  ran¬ 
domly  permuted  and  then  run  through  a  low-pass  filter  to  isolate  and  iden¬ 
tify  frequencies  which  carry  a  large  fraction  of  the  signal’s  energy.  This  leads 
to  randomized  algorithms  that  fail  on  a  non- negligible  set  of  possible  inputs. 
On  the  other  hand,  Iwen  [2010]  takes  advantage  of  the  combinatorial  proper¬ 
ties  of  aliasing  in  order  to  identify  the  significant  frequencies.  This  leads  to  a 
deterministic  algorithm  with  higher  runtime  and  sampling  requirements  than  the 
randomized  algorithms  mentioned.  Both  of  these  randomized  and  deterministic 
approaches  have  drawbacks.  Randomized  algorithms  are  not  suitable  for  failure- 
intolerant  applications,  while  the  process  used  to  reconstruct  significant  frequen¬ 
cies  in  Iwen  [2010]  relies  on  the  CRT,  which  is  highly  unstable  to  errors  in  the 
residues.  While  there  do  exist  algorithms  for  “noisy  Chinese  Remaindering”  [Gol- 
dreich  et  al.  (2000);  Boneh  (2002);  Shparlinski  and  Steinfeld  (2004)]  these  have  thus 
far  not  found  application  to  the  sparse  DFT  problem,  and  we  leave  this  as  future 
work. 

As  this  paper  was  being  prepared,  the  authors  became  aware  of  an  indepen¬ 
dent  work  using  very  similar  methods  for  frequency  estimation  in  the  noiseless 
case  [Hassanieh  et  al.  (2012a)].  Both  methods  consider  the  phase  difference  between 
Fourier  samples  to  extract  frequency  information,  but  are  based  on  different  tech¬ 
niques  for  binning  significant  frequencies.  The  authors  of  Hassanieh  et  al.  [2012a] 
use  random  dilations  and  efficient  filters  of  Hassanieh  et  al.  [2012b],  whereas  we 
use  different  sample  lengths  in  the  spirit  of  Iwen  [2010].  We  believe  both  contri¬ 
butions  are  of  interest,  and  reinforce  the  notion  that  exploiting  phase  information 
is  critical  for  developing  fast,  robust  algorithms  for  the  sparse  Fourier  transform 
problem. 


1.2.  Relationship  to  compressed  sensing 

The  term  “compressed  sensing”  refers  to  a  new  paradigm  in  signal  processing  which 
seeks  to  recover  a  compressible  signal  from  a  number  of  linear  measurements  roughly 
proportional  to  its  information  content,  rather  than  its  nominal  dimension.  While 
this  paper  does  not  make  explicit  use  of  the  results  or  algorithms  of  compressed 
sensing,  there  are  parallels  in  the  approaches  used.  The  purpose  of  this  section  is 
to  clarify  the  relationship  between  the  two. 

All  algorithms  for  the  sparse  Fourier  transform  problem  take  a  small  number  of 
samples  of  the  input  x,  either  at  random  or  in  a  deterministic  fashion.  These  samples 
are  then  processed  in  a  highly  non-linear,  algorithm-dependent  manner  to  produce 
a  fc-term  Fourier  representation  of  x  -  that  is,  a  list  { (cd^,  a^)}^=1  of  significant 
frequency /coefficient  pairs.  In  other  words,  these  algorithms  approximately  solve 
the  severely  underdetermined  system  RF*x  —  Rx,  where  R  is  the  restriction  to 
the  samples  used  by  the  algorithm,  F*  is  the  adjoint  of  the  N  x  N  discrete  Fourier 
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matrix  with  entries 

F:jk  =  0  <j,  k  <  N  (1) 

and  x  is  the  DFT  of  x. 

In  Candes  et  al.  [2006],  Candes,  Romberg,  and  Tao  considered  the  dual  prob¬ 
lem  that  of  recovering  a  given  signal  from  highly  incomplete  Fourier  measure¬ 
ments.  Specifically,  suppose  that  a  signal  x  of  length  N  is  the  superposition  of  k 
spikes  at  times  t  —  Tj\ 

k 

*[*]  =  _r7')-  ( 2 ) 

3  = 1 

The  authors  show  that,  with  high  probability,  x  can  be  recovered  exactly  from  a 
randomly  chosen  set  fl  of  m  frequencies  from  the  DFT  of  x,  provided 

m  >  Ck  log  N  (3) 

for  some  constant  C  whose  value  depends  on  the  desired  probability  of  success. 
This  can  be  viewed  as  the  severely  underdetermined  linear  system  dual  to  the 
system  described  above:  RFx  —  Rx.  The  recovery  algorithm  in  this  case  is  the  t-\ 
minimization 

g*  —  argmin ||g||i  subject  to  g(u)  —  f(uj)  for  all  u  €  Q.  (4) 

The  idea  of  using  l\  minimization  to  recover  sparse  vectors  has  been  stud¬ 
ied  extensively  in  a  number  of  research  communities,  including  seismic  imaging 
[Santosa  and  Symes  (1986)],  image  processing  [Rudin  et  al.  (1992)],  and  signal  pro¬ 
cessing  [Chen  et  al.  (1998)],  where  it  is  commonly  referred  to  as  basis  pursuit.  The 
theoretical  foundations  of  l\  approximation  are  treated  in  depth  in  the  monograph 
[Pinkus  (1989)]. 

Other  sampling  schemes  and  recovery  algorithms  have  been  studied  for  the 
compressed  sensing  problem.  For  example,  in  Rauhut  [2007],  i\  minimization  is  used 
with  points  sampled  randomly  from  a  continuous  distribution,  while  in  Xu  [2011]  a 
deterministic  sampling  scheme  is  analyzed  with  reconstruction  through  Orthogonal 
Matching  Pursuit  (OMP).  Other  works  which  analyze  the  performance  of  OMP  in 
the  compressed  sensing  setting  include  Tropp  and  Gilbert  [2007]  and  Kunis  and 
Rauhut  [2008]. 

Sparse  Fourier  approximation  and  compressed  sensing  are  therefore  broadly  sim¬ 
ilar  in  both  their  goals  (sparse  approximation  of  signals)  and  methods  (in  particular, 
the  use  of  randomization.)  There  are,  however,  substantial  differences  between  the 
two,  which  we  now  enumerate. 

(1)  Sampling  requirements.  The  compressed  sensing  model  requires  measurement 
matrices  to  satisfy  the  Restricted  Isometry  Property,  which  has  been  shown 
to  hold  with  high  probability  for  random  Gaussian,  Bernoulli,  and  Fourier 
ensembles.  Sparse  Fourier  algorithms,  on  the  other  hand,  generally  require  more 

135§gg3-5 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Adv.  Adapt.  Data  Anal.  2013.05.  Downloaded  from  www.worldscientific.com 
by  141.219.229.216  on  12/21/15.  For  personal  use  only. 


D.  Lawlor,  Y.  Wang  &  A.  Christlieb 


structure  in  their  sampling  sets.  This  is  obviously  true  for  the  deterministic  algo¬ 
rithms,  and  also  for  some  of  the  randomized  versions  -  in  particular,  [Gilbert 
et  al.  (2005)]  requires  samples  that  lie  on  arithmetic  progressions. 

(2)  Reconstruction  costs.  As  mentioned  above,  the  reconstruction  of  the  target 
signal  in  the  compressed  sensing  model  is  achieved  by  a  convex  optimization 
problem  (which  can  be  recast  as  a  linear  program).  This  is  expected  to  incur 
a  computational  costs  of  0(N 3)  for  a  signal  of  length  N.  Most  sparse  Fourier 
transform  algorithms  have  time  complexity  that  is  polylogarithmic  in  N,  and 
so  are  exponentially  faster. 

(3)  Allocation  of  resources.  The  balance  between  the  two  previous  items  is  the 
major  point  of  distinction.  Indeed,  we  view  the  comparison  of  the  two  paradigms 
as  an  “apples-to-oranges”  scenario:  In  the  seismic  imaging  environment  (where 
practitioners  have  recently  implemented  compressed  sensing  methods  [Lin  and 
Herrmann  (2007);  Demanet  and  Peyre  (2011)]),  high  acquisition  costs  make 
long  processing  times  on  the  back  end  more  palatable.  Sparse  Fourier  transform 
algorithms,  however,  were  developed  with  data  streaming  applications  in  mind. 
In  this  area,  low  signal  acquisition  costs  and  enormous  problem  sizes  necessitate 
fast  algorithms  with  sparing  use  of  memory  resources. 

1.3.  New  results 

In  this  paper  we  describe  a  simple,  deterministic  algorithm  that  avoids  reconstruc¬ 
tion  with  the  CRT.  We  are  thus  able  to  avoid  two  pitfalls  associated  with  exist¬ 
ing  algorithms.  Our  method  relies  on  sampling  the  signal  in  the  time  domain  at 
slightly  shifted  points,  and  thus  it  assumes  access  to  an  underlying  continuous¬ 
time  signal.  The  shifted  time  samples  allow  us  to  determine  the  value  of  significant 
frequencies  in  sub-sampled  FFTs  and  also  indicate  when  two  or  more  frequen¬ 
cies  have  been  aliased  in  such  a  sub-sampled  FFT.  These  two  key  facts  allow  us 
to  significantly  reduce  (by  up  to  two  orders  of  magnitude)  the  average-case  sam¬ 
pling  and  runtime  complexity  of  the  sparse  FFT  over  a  certain  class  of  random 
signals.  Our  worst-case  bounds  improve  by  a  constant  factor  those  of  prior  deter¬ 
ministic  algorithms.  We  present  both  adaptive  and  non-adaptive  versions  of  our 
algorithms.  If  the  application  allows  samples  to  be  acquired  adaptively  (that  is, 
dependent  on  previous  samples),  we  are  able  to  improve  further  on  our  average-case 
bounds. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Sec.  2,  we  introduce 
notation  and  prove  the  technical  lemmas  underlying  our  algorithms.  In  Sec.  3,  we 
introduce  randomized  and  deterministic  versions  of  our  algorithm.  In  Sec.  4,  we 
prove  that  our  algorithm  has  average-case  runtime  and  sampling  complexities  of 
0(Hog(/c))  and  0(fc),  respectively.  In  Sec.  5,  we  present  the  results  of  an  empirical 
evaluation  of  our  algorithm  and  compare  its  runtime  and  sampling  requirements  to 
competing  algorithms.  Finally  in  Sec.  6,  we  provide  some  concluding  remarks  and 
discuss  ongoing  work  to  appear  in  the  future. 
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2.  Mathematical  Background 
2.1.  Preliminaries 

Throughout  this  work  we  shall  be  concerned  with  frequency-sparse  band-limited 
signals  S  :  [0, 1)  — »  C  of  the  form 

S(0  =  (5) 

3  = 1 

where  ojj  G  [-N/2,  N/2)  fl  Z,  a.j  G  C,  and  k  <C  N.  The  Fourier  series  of  S  is  given 
by 

S(u)  «  [  S(t)e~27viuitdt ,  cu  G  Z,  (6) 

Jo 

so  that  for  signals  of  the  form  (5)  we  have  S (ujj )  =  a:j  and  S(oj)  —  0  for  all  other 
c o  G  [— N/2,  N/2)  n  Z.  Given  any  finite  sequence  S  =  (so,  Si,  •  ■  • ,  sp-i)  of  length  p 
we  define  its  DFT  by 

S[A1  =  £  ^  smi\  (7) 

3=0  3=0 

P  ^  _  27ri 

where  h  —  0, 1, . . .  ,p  —  1,  S[j\  :=  Sj  and  Wp  :=  e  p  is  the  primitive  pth  root  of 
unity.  The  FFT  allows  the  computation  of  S  in  O(plogp)  steps. 

We  apply  the  DFT  to  discrete  samples  of  S(t)  to  compute  the  Fourier  coefficients 
a,j  of  S(t).  For  an  integer  p  and  real  £  >  0  we  form  discrete  arrays  of  samples  of  S 
of  length  p  via 

Sp\J]  =  s  (^)  >  Sp,e\j]*s(^+ey  j  —  0,1, . . .  ,p  —  1. 

Now  assume  that  all  ojj  (mod  p),  1  <  j  <  k  are  distinct.  It  is  a  simple  derivation 
to  obtain 

( pa:j  h  =  ujj  (mod  p) 

Sp[h)  =  t 

[0  otherwise. 

By  examining  the  peaks  of  Sp[h]  we  will  be  able  to  determine  {ojj  (mod  p)  :  1  < 
j  <  k } .  Previous  approaches  applied  the  CRT  to  reconstruct  {c 0j}  by  taking  a 
suitable  number  of  p’s,  which  must  overcome  the  problem  of  registrations  to  match 
up  each  ojj  whenever  a  new  p  is  used  [see,  e.g.  Iwen  (2010,  2012)].  Our  algorithm 
takes  a  different  approach  using  the  shifted  sub-samples.  Note  that 

^  (paje2lT1£UJ:i  h  =  ujj  (mod  p) 

SP)e[h\  —  <  . 

^0  otherwise. 
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It  follows  that  in  this  setting,  for  h  =  ojj  (mod  p)  we  have  =  e27"6^’' .  Hence 


2tvsu =  Arg 


SP,Ah] 

Sp[h] 


(mod  27t), 


(8) 


where  Arg(z)  denotes  the  phase  angle  of  the  complex  number  z  in  [— rc,ir).  Assume 
that  we  take  |s|  <  jr.  Then  uj  is  completely  determined  by  (8)  as  there  will  be  no 
wrap-around  aliasing,  and 


Un 


2tts 


Arg 


Sp,e[h] 

Sp[h] 


(9) 


In  fact,  more  generally,  if  we  have  an  estimate  of  uij,  say  \uj\  <  then  by  taking 
|e|  <  A  the  same  reconstruction  formula  (9)  holds.  Note  that  even  though  the 
denominator  of  (9)  contains  a  very  small  number  e,  it  can  be  verified  through 
Taylor  expansion  that  the  numerator  is  of  the  same  order,  so  that  the  ratio  is  well- 
behaved  in  the  noiseless  case,  at  least  for  ui  sufficiently  far  from  ±tt.  The  observation 
that  by  taking  slightly  shifted  samples  will  allow  us  to  identify  frequencies  in  S(t) 
underlies  the  algorithms  which  follow,  and  the  bulk  of  this  paper  analyzes  various 
aspects  of  the  proposed  algorithms,  such  as  efficiency  and  robustness. 

One  of  the  problems  is  that  when  p  <  N,  it  is  possible  that  two  or  more 
distinct  frequencies  will  have  the  same  remainder  modulo  p.  In  this  case,  we  say 
the  frequencies  are  aliased  or  collide  (mod  p).  In  general,  for  he  {0, . . .  ,p  —  1}  and 
the  given  signal  S(t)  let  I(S,  hm,p )  :=  {j  :  uij  =  h  (mod  p)}.  Then  we  have 


Sp[h\  =  S(u)=P  aJ- 

uj=h  (mod  p)  jEl(S,h\p ) 


(10) 


When  aliasing  occurs  reconstruction  via  (9)  is  no  longer  valid.  The  aliasing  phe¬ 
nomenon  presents  a  serious  challenge  for  any  method  with  sub-linear  sampling 
complexity.  In  the  next  section,  we  develop  a  simple  test  to  determine  whether  or 
not  aliasing  has  occurred  in  a  p-length  DFT,  which  then  allows  us  to  effectively 
overcome  this  challenge  and  develop  provably  correct  sub-linear  algorithms. 


2.2.  Technical  lemmas 

To  effectively  apply  the  sub-sampling  idea  in  a  Fourier  algorithm,  one  must  first 
overcome  the  aliasing  challenge.  Using  shifted  sub-samples  gives  us  a  simple  yet 
extremely  effective  criterion  to  determine  whether  or  not  aliasing  has  occurred  at  a 
given  location  in  a  p-length  DFT  without  resorting  to  complicated  combinatorial 
techniques.  Observe  that  complementing  (10)  we  have 

Sp,£[h]=P  aJ'e27ri£U,J' •  (n) 

jeI(S,h;p ) 
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It  follows  that 


\Sp,e[h]\2  -  \Sp[h}\2  p2  ^2  ajCiie i,ucv“'3' 

j,leI(S,h;p) 


2irie(u}j -ug)  _  ^2 


dj 


jeI(S,h;p) 


(12) 


Lemma  1.  Let  p  >  1  and  h  E  {0, 1, . . .  ,p  —  1}.  Assume  that  q  —  |/(£,  h]p)\  >  1, 
i.e.  uij  =  h(modp)  for  more  than  one  j  in  S(t).  Then  we  have  the  following : 


(A)  Let  e  >  0  and  E  {ujj  —  uim  :  j,m  E  I(S,  h]p)}.  Suppose  that  all  elements  of 
eE  are  distinct  (mod  1).  Then  \Sp,rn£[h]\  ^  |5p[/i]|  for  some  1  <  m  <  q2  —  q. 

(B)  For  almost  all  e  >  0  we  have  |5Pj£[h]|  ^  |S'p[/i]|. 


Proof.  The  proof  of  part  (B)  is  immediate  from  (12).  Observe  that  f(s)  := 
\Sp,e[h]\2  ~  |5p[h]|2  is  trigonometric  polynomial  in  e,  and  it  is  not  identically  0  given 
that  q  —  |/(S,  h\p) \  >  1.  Thus,  it  has  at  most  finitely  many  zeros  for  £  E  [0, 1),  and 
hence  (B)  is  clearly  true. 

We  resort  to  the  Vandermonde  matrix  to  prove  part  (A).  For  simplicity  we 
write  f(t)  =  EaeFc^2niat.  Set  ra  :=  e27nae  where  £  satisfies  the  hypothesis  of 
the  lemma,  which  implies  that  all  r.j  are  distinct.  Assume  the  claim  of  part  (A)  is 
false.  Then  we  have  f(m£ )  —  0  for  all  0  <  m  <  q2  —  q.  Here,  /( 0)  =  0  is  automatic 
because  SP} o  =  Sp.  Thus  we  have 

0,  m  —  0, 1, . . . ,  q2  —  q.  (13) 

aeE 

But  the  cardinality  of  E  is  at  most  q2  —  q  +  1,  which  means  that  there  are  at  most 
q2  —  q  +  1  terms  in  the  sum  in  (13).  Because  all  ra  are  distinct  the  matrix  [ r ™]  is 
a  non-singular  Vandermonde  matrix,  and  for  (13)  to  hold  all  ca  must  be  zero.  This 
is  clearly  not  the  case,  and  a  contradiction.  □ 


Remark.  Any  irrational  £  or  £  =  |  with  a,  b  coprime  and  b  >  2 N  will  satisfy  the 
hypothesis  of  part  (A)  of  Lemma  1.  It  is  also  easy  to  show  that  in  the  special  case 
where  all  coefficients  aj  are  real  and  \I(S,  h;p) |  =  2,  we  have  |SP;£[h]|  ^  |5p[h]|  for 
any  £  —  f  with  a,  b  coprime  and  b  >  N. 

Lemma  1  allows  us  to  determine  whether  aliasing  has  occurred  by  whether 
|SP)£[/r]|/|Sp[/i]|  =  1  for  a  few  values  of  £.  It  offers  both  a  deterministic  (part  (B)) 
and  a  random  (part  (A))  procedure  to  identify  aliasing  in  the  sub-sampled  DFTs. 
In  practice,  we  need  to  set  a  tolerance  r  in  order  to  accept  or  reject  frequencies 
according  to  the  criterion 


i  Aw  i 


(14) 


We  typically  choose  £  =  1/cN  for  some  small  constant  c  >  2,  which  would  satisfy 
the  hypothesis  of  part  (A)  of  Lemma  1.  A  tolerance  on  the  order  of  p/N  works  well 
in  general,  which  is  what  we  use  in  our  experiments  in  Sec.  5  below. 
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In  our  algorithms,  we  will  take  a  number  of  sub-sampled  DFTs  of  an  input  signal 
S(t)  of  the  form  (5),  whose  lengths  we  denote  pi.  Lemma  1  allows  us  to  determine 
whether  or  not  two  or  more  frequencies  are  aliased,  so  that  we  only  add  the  non- 
aliased  term  to  our  representation.  Since  it  is  unlikely  that  two  or  more  frequencies 
are  aliased  modulo  two  different  sampling  rates,  using  a  different  pi  in  a  subsequent 
iteration  lets  us  quickly  discover  all  frequencies  present  in  S(t).  Lemma  2  gives  a 
worst-case  bound  on  the  number  of  pf  s  required  by  our  deterministic  algorithm 
to  identify  all  k  frequencies  in  a  given  Fourier- sparse  signal.  It  is  similar  to  Iwen 
[2010,  Lemma  1],  but  with  a  smaller  constant.  In  its  proof  we  use  the  CRT,  which 
we  quote  here  for  completeness  [see,  e.g.  Niven  et  al.  (1991)]. 

Theorem  1  (Chinese  Remainder  Theorem).  Any  integer  n  is  uniquely  spec¬ 
ified,  modulo  N  by  its  remainders  modulo  m  pairwise  relatively  prime  numbers  p£, 
provided  Y\T=iPt  —  -W 

Lemma  2.  Let  M  >  1.  It  suffices  to  take  1  +  {k  —  l)[logM  IV  J  pairwise  relatively 
prime  pi's  withp£  >  M  to  ensure  that  each  frequency  u>j  is  isolated  (i.e.  not  aliased ) 
( mod  pi )  for  at  least  one  l. 

Proof.  Assume  otherwise,  namely  that  given  pi  for  i  ~  1,2,  ...,L  with  L  > 
k[\ogMN\  there  exists  some  ojj  such  that  Uj  is  aliased  (mod  pf).  By  the  Pigeon 
Hole  Principle  there  exists  at  least  one  ojm  ojj  such  that  ujj  —  um  =  0  (mod  pf) 
at  least  q  times,  where  q  >  |_logM  WJ  ■  Without  loss  of  generality  we  assume  that 
u>j  —  c om  =  0  (mod  pi)  for  i  =  1, 2, . . . ,  q.  Now  by  the  fact  that  pi  >  M,  we  have 

q 

Jpi>Mq>N. 

e=i 

By  the  CRT  we  would  then  have  u>j  =  (mod  N),  a  contradiction.  □ 

We  remark  that  the  algorithm  in  Iwen  [2010]  requires  taking  1  +  2 k  log;.  N  co¬ 
prime  sample  lengths,  since  that  algorithm  requires  each  oj  to  be  isolated  in  at 
least  half  of  the  DFTs  of  length  pi.  This  requirement  stems  from  the  fact  that  that 
algorithm  cannot  distinguish  between  aliased  and  non-aliased  frequencies  in  a  given 
sub-sampled  DFT.  Our  worst-case  bound  is  approximately  a  factor  of  two  better, 
though  in  practice  our  algorithms  never  use  all  those  sample  lengths  on  random 
input.  The  fact  that  we  can  tell  which  frequencies  are  “good”  for  a  given  pi  allows 
us  to  construct  our  Fourier  representation  one  term  at  a  time,  and  quit  when  we 
have  achieved  a  prescribed  stopping  criterion. 

3.  Algorithms 

Both  of  our  algorithms  proceed  along  a  similar  course;  in  fact  they  differ  only  in 
the  choice  of  the  sample  lengths  p£.  We  assume  that  we  are  given  access  to  the 
continuous-time  signal  S(t)  whose  Fourier  coefficients  we  would  like  to  determine, 
and  further  that  we  can  sample  from  S  at  arbitrary  points  t  in  unit  time.  This  is  an 
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appropriate  model  for  analog  signals,  but  not  for  discrete  ones.  In  the  discrete  case, 
one  could  interpolate  between  given  samples  to  approximate  the  required  S'- values, 
though  we  have  not  implemented  or  analyzed  this  case.  (The  same  assumptions 
hold  for  the  algorithms  in  Iwen  [2010],  while  those  in  Gilbert  et  al.  [2002,  2005] 
and  Hassanieh  et  al.  [2012b]  are  formulated  purely  in  the  discrete  realm.)  In  this 
paper,  mainly  limit  ourselves  to  the  noiseless  case.  Though  this  is  a  highly  unre¬ 
alistic  assumption,  it  permits  a  simple  description  of  the  underlying  algorithm.  In 
Sec.  3.3,  we  discuss  some  of  the  problems  associated  with  noisy  signals  and  give 
a  minor  modification  of  our  algorithm  for  low-level  noise.  A  second  manuscript  in 
preparation  addresses  the  issue  of  noise  specifically,  with  more  significant  modifica¬ 
tions  to  the  algorithms  described  below. 

3.1.  Non- adaptive 

Our  algorithms  start  by  choosing  a  sample  length  pi  such  that  p\  >  ck  for  some 
constant  c  >  1.  For  a  fixed  e  <  1/N,  we  then  compute  Sp  and  SPjS,  sort  the  results 
by  magnitude,  and  compute  frequencies  u>  via  (9)  for  the  k  largest  coefficients  in 
absolute  value.  We  then  check  whether  or  not  each  of  those  frequencies  is  aliased 
via  (10),  and  if  it  is  not,  we  add  it  to  our  list.  The  coefficient  is  given  by  the  unshifted 
sample  value  Sp[h\  at  that  frequency.  After  this,  we  combine  terms  with  the  same 
frequency  and  prune  small  coefficients  from  the  list.  We  then  iterate  until  a  stopping 
criterion  is  reached.  In  the  empirical  study  described  in  Sec.  5,  we  stopped  when 
the  number  of  distinct  frequencies  in  our  list  equalled  the  desired  sparsity. 

Our  deterministic  algorithm  chooses  pe  to  be  the  ith.  prime  greater  than  ck.  This 
ensures  that  all  samples  lengths  are  co-prime,  at  the  expense  of  taking  slightly  more 
samples  than  necessary.  By  Lemma  2,  1  +  {k  —  1)  Llogc/c  -WJ  such  Pts  suffice  to  isolate 
every  u  at  least  once.  This  gives  us  worst-case  sampling  and  runtime  complexity  on 
the  same  order  as  Iwen  [2010],  though  the  results  in  Sec.  5  indicate  that  on  average 
we  significantly  outperform  those  pessimistic  bounds. 

Our  Las  Vegas  algorithm  chooses  pf  uniformly  at  random  from  the  interval 
[c\k,  C2&]  for  constants  1  <  c\  <  C2-  In  this  case  we  cannot  make  a  worst-case  guar¬ 
antee  on  the  number  of  iterations  needed  by  the  algorithm  to  converge.  However, 
the  results  in  Sec.  5  indicate  that  the  Las  Vegas  version  performs  similarly  to  the 
deterministic  version  on  the  class  of  signals  tested. 

3.2.  Adaptive 

The  algorithms  can  also  be  implemented  in  an  adaptive  fashion,  by  which  we  mean 
that  the  size  of  the  current  representation  is  taken  into  account  in  subsequent 
iterations.  In  particular,  if  R  is  our  current  representation,  we  let  k*  —  k  —\R\  and 
choose  the  next  pe  with  respect  to  k*  instead  of  k.  Moreover,  before  taking  DFTs, 
we  subtract  off  the  contribution  from  the  current  representation,  so  that  effort  is 
not  expended  re-identifying  portions  of  the  spectrum  already  discovered.  This  idea 
is  similar  to  that  in  Gilbert  et  al.  [2002,  2005],  though  in  our  empirical  studies 
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the  evaluation  of  the  representation  is  done  directly,  rather  than  as  an  unequally- 
spaced  FFT.  This  gives  our  algorithms  asymptotically  slower  runtime,  but  the  effect 
is  negligible  for  the  values  of  k  studied  in  Sec.  5.  A  formal  description  appears  below 
in  Algorithm  1. 


Algorithm  1.  Phaseshift 


Input:  function  pointer  S,  integers  ci,  C2,  k,  N ,  real  e 
Output:  R ,  a  sparse  representation  for  S 
R  4 —  0,  £q  4 —  0 ,  <ET i  4 —  £,  £  4 —  1 

while  \R\  <  k  do 


10: 


15: 


20: 


25: 


k*  4-  k  -  |i£| 

Pi  4 —  first  prime  >  C\k* 

for  m  —  0  to  1  do 
for  j  —  0  to  £  —  1  do 

Si,m\j]  4“  S  f  — — h  £r 

\Pe 

^rep  [j]  4  ^  ]  c^e 

(u),cu,)eR 


{or  k  if  non-adaptive} 
(or  Uniform(ci/c*,  C2 k*)  if  Las  Vegas} 


2iriui(j/pe+e  m  ) 


end  for 

Se,m  4—  FFT(SVjm  —  Sre p) 

^•sort  ^ 

Si,m  «-  Sort (S£>m) 

for  j  =  1  to  k*  do 


27 T£ 


Arg 


Uj,e  4- 

end  for 
end  for 

for  j  —  1  to  k*  do 

-sort  r 


-sort  _  _ 

S, ,  b'l 


-sort 

Vo 


bl 


if 


’£,0 


bl 


-sort  r 


- 1 


<  —  then 

N 


’e, i  bl 

J?  4—  R  U  |  Sip[ujj:i]^  | 

end  if 
end  for 

collect  terms  in  R  with  same  ui 
prune  small  coefficients  from  R 
£<-£+! 

end  while 


(omit  if  non-adaptive} 
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3.3.  Modifications  in  the  presence  of  noise 

In  the  noiseless  versions  of  the  algorithms  described  in  this  paper,  a  test  for  aliasing 
is  implemented  by  considering  the  ratio  of  magnitudes  of  shifted  and  unshifted 
peaks.  When  the  samples  are  corrupted  by  noise,  there  will  be  two  challenges.  The 
first  challenge  is  that  the  reconstruction  of  frequencies  from  shifts  will  be  corrupted 
by  noise.  The  second  challenge  is  that  there  will  be  variations  among  the  magnitudes 
even  for  non-aliased  terms,  so  a  higher  threshold  that  depends  on  the  size  of  the 
noise  must  be  set.  When  this  threshold  is  too  large  it  affects  the  ability  to  distinguish 
aliased  terms  as  there  will  be  an  increased  number  of  false  negatives.  On  the  other 
hand,  lower  thresholds  that  reduce  false  negatives  will  lead  to  an  increased  number 
of  false  positives. 

The  first  challenge  can  be  addressed  effectively  through  a  combination  of  using 
larger  pf  s,  multiple  shifts  and  a  multiscale  unwrapping.  The  idea  of  using  larger 
pf  s  is  rather  straightforward  yet  effective.  For  any  given  pj  the  DFT  detects  the 
location  of  the  frequencies  modulo  pj  rather  accurately  even  with  substantial  noise. 
Furthermore,  the  reconstructed  frequencies  will  still  tend  to  cluster  around  the 
true  value.  Suppose,  that  we  sample  the  signal  and  compute  DFTs  of  length  pj  on 
these  samples.  The  locations  of  the  peaks  in  these  short  DFTs  tell  us  the  accurate 
value  of  uj  mod  pt  for  each  unaliased  frequency  uj  appearing  in  the  signal.  Writing 
uj  —  apj  +  b  with  a,  b  E  Z,  we  now  know  b  and  must  determine  a. 

With  a  small  amount  of  noise  the  reconstructed  frequencies  uj  using  (9)  will 
be  close  to  the  true  uj.  We  can  thus  round  uj  to  the  nearest  integer  of  the  form 
apj  +  6,  which  will  recover  the  true  frequency  uj  as  long  as  \uj  —  uj\  <  Pj/%.  For 
high  noise  levels,  it  is  possible  that  the  uj  will  deviate  by  more  than  pj/ 2  from  uj, 
so  that  the  value  for  a  given  by  rounding  will  be  incorrect.  By  choosing  larger  pj 
(i.e.  increasing  the  parameter  ci)  one  can  alleviate  the  problem  somewhat,  provided 
that  the  noise  level  is  not  too  high.  When  the  noise  level  is  so  high  that  taking  a 
large  pj  is  no  longer  economical,  a  potential  solution  is  to  take  multiple  shifts  and 
employ  a  multiscale  unwrapping  technique.  We  are  still  at  the  preliminary  stage  in 
our  study  of  these  new  techniques,  but  early  results  are  very  encouraging. 

The  second  challenge  poses  a  bigger  problem,  but  again  it  can  be  addressed  in 
several  ways.  The  multiscale  unwrapping  method  will  repeatedly  check  for  alias¬ 
ing  at  each  stage,  which  makes  it  highly  unlikely  that  an  aliased  frequency  will 
pass  through  all  the  tests.  Even  in  the  unlikely  even  that  it  does,  our  algorithm 
allows  false  positives.  Since  each  mode  is  subtracted  from  the  original  signal  in  our 
algorithm,  a  false  positive  frequency  will  lead  to  an  extra  mode  in  the  new  signal. 
As  the  process  continues  it  will  be  extracted  and  cancel  out  the  false  frequency 
extracted  earlier. 

4.  Average-Case  Analysis 

In  this  section,  we  prove  that  the  average-case  runtime  and  sampling  complexity 
of  our  algorithm  are  0(/clog/c)  and  0(/c),  respectively.  This  is  shown  over  a  class 
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of  random  signals  described  in  Sec.  4.2.  Before  giving  this  result  on  the  expected 
runtime  and  sampling  complexity,  in  Sec.  4.1  estimate  the  costs  of  a  single  iteration 
of  the  while  loop  in  Algorithm  1,  lines  5-25.  We  then  describe  in  Sec.  4.2,  the 
random  signal  model  over  which  we  prove  our  average-case  bounds.  In  Sec.  4.3,  we 
prove  that  the  expected  number  of  iterations  of  the  while  loop  is  constant,  and  in 
Sec.  4.4,  we  use  this  result  to  prove  our  average-case  bounds. 


4.1.  While  loop  runtime  and  sampling  complexity 

The  computational  cost  of  the  while  loop  in  Algorithm  1,  lines  5-25  is  dominated 
by  three  operations.  The  first  is  the  evaluation  of  the  current  representation  R 
of  k  —  k*  terms  at  the  0(/c*)  points  j /pi  in  line  10.  In  our  implementation,  we 
simply  calculated  this  directly,  looping  over  both  the  sample  points  and  the  terms 
in  the  representation.  The  complexity  of  this  implementation  is  O (pi(k  —  k *))  a# 
O (k*(k  —  k *))  —  0(/c2),  and  while  non-equispaced  FFT  [Dutt  and  Roklilin  (1993); 
Anderson  and  Dahleh  (1996)]  yield  an  asymptotically  faster  runtime  of  O {k  log(fc)), 
they  also  incur  large  overhead  costs.  For  the  values  of  k  considered  in  this  paper,  the 
direct  evaluation  seems  to  have  little  effect  on  the  overall  runtime.  The  other  two 
dominant  computational  tasks  in  the  inner  loop  are  the  FFTs  of  0{k)  samples  and 
the  subsequent  sorting  of  these  DFT  coefficients.  It  is  well-known  that  both  of  these 
operations  can  be  done  in  time  ©(/clog(fc))  [Cormen  et  al.  (2001)].  Thus  the  inner 
loop  has  overall  time  complexity  (~)(k  log(fc)),  assuming  the  use  of  non-equispaced 
FFTs. 


4.2.  Random  signal  model 

For  both  the  average-case  analysis  and  for  the  empirical  evaluation  described  in 
Sec.  5,  we  considered  test  signals  with  uniformly  random  phase  over  the  bandwidth 
and  coefficients  chosen  uniformly  from  the  complex  unit  circle.  In  other  words,  given 
k  and  N,  we  choose  k  frequencies  uij  uniformly  at  random  (without  replacement) 
from  [— N/2,  N/2)  H  Z.  The  corresponding  Fourier  coefficients  a:j  are  of  the  form 
e27n0j,  where  6:j  is  drawn  uniformly  from  [0, 1).  The  signal  is  then  given  by 

k 

S(t)  =  2>e“"A  (15) 

3= 1 

This  is  the  standard  signal  model  considered  in  previous  empirical  evaluations  of 
sub-linear  Fourier  algorithms  [Iwen  et  al.  (2007);  Iwen  (2010);  Hassanieh  et  al. 
(2012b)]. 

4.3.  Markov  analysis  of  collisions 

In  order  to  analyze  the  expected  runtime  and  sampling  complexity  of  our  algo¬ 
rithms,  we  must  estimate  the  expected  number  of  collisions  among  frequencies 
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modulo  the  sample  lengths  used  by  the  algorithms.  Recall  that  in  the  noiseless 
case,  our  algorithms  are  able  to  detect  when  a  collision  between  two  or  more  fre¬ 
quencies  has  occurred,  and  for  those  that  are  not  aliased  we  are  able  to  calculate 
the  value  of  the  frequency.  Thus  we  seek  to  estimate  the  expected  fraction  of  fre¬ 
quencies  that  are  aliased  modulo  a  given  sample  length  p,  since  this  determines 
how  many  passes  the  algorithm  makes.  In  this  section,  we  derive  bounds  on  the 
expected  value  of  this  quantity  and  discuss  how  the  stopping  criteria  used  in  the 
algorithm  affect  its  average-case  performance. 

In  the  random  signal  model  considered  in  Sec.  5,  we  assume  the  k  frequencies  are 
uniformly  distributed  over  the  bandwidth  [— X/2,  N/ 2),  and  so  the  residues  u>  mod  p 
are  also  uniformly  distributed  over  [0 ,p—  1].  Our  problem  then  becomes  a  classical 
occupancy  problem:  The  number  of  collisions  among  the  frequencies  is  equivalent 
to  the  number  of  multiple-occupancy  bins  when  k  balls  are  thrown  uniformly  at 
random  into  p  bins.  Define  Xm  to  be  the  number  of  single-occupancy  bins  after  m 
balls  are  thrown,  Ym  to  be  the  number  of  multiple-occupancy  bins  after  m  balls  are 
thrown,  and  Zm  to  be  the  number  of  zero-occupancy  bins  after  m  balls  are  thrown. 
Since  p  is  constant,  we  have  the  trivial  relationship  Zm  —  p  —  Xm  —  Ym,  so  it  suffices 
to  consider  only  the  pair  (Xm,  Ym).  When  the  (m+  l)st  ball  is  thrown,  we  have  the 
following  possibilities: 

•  it  lands  in  an  unoccupied  bucket,  with  probability  Zmlp  —  1  —  (Xm  +  Ym)/p ; 

•  it  lands  in  a  single-occupancy  bucket,  with  probability  Xm/p; 

•  it  lands  in  a  multiple-occupancy  bucket,  with  probability  Ym/p. 


In  the  first  case,  we  have  Xm+i  =  Xm  +  l,Fm+i  =  Ym ;  in  the  second  case,  we 
have  Xm+i  =  Xm  —  l,Ym+i  —  Ym  +  1;  and  in  the  third  case,  we  have  Xm+i  = 
Xm,  Ym+ 1  =  Ym.  Conditioning  on  the  values  of  Xm,  Ym  we  have 


( 

Xm+1 

Xm 

v 

'i  -  2 Ip 

lm+1 

Y 

1  m 

)~ 

i  Ip 

i/p 

xm 

+ 

1 

l 

Ym 

0 

(16) 


so  that  the  system  forms  a  Markov  chain.  By  recursively  conditioning  on  the  values 
of  Xm_i,  Ym- 1,  we  can  calculate  the  expected  values  of  Xkl  Yk  for  any  k  >  0  using 
the  initial  condition  Xi  —  l.Y\  —  0.  Denoting  by  A  the  matrix  in  the  right-hand 
side  of  Eq.  (16),  we  have 
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X, 

Yk 


k- i  / 

£  [A 

m=0  \ 


(k- 1 
\m=0 


1 
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(17) 


Since  p(A)  =  1  —  1  /p  <  1,  where  p  is  the  spectral  radius,  the  geometric  matrix 
series  can  be  written 


k- 1 

Am  =  (/  -  A)~\I  -  Ak). 

m= 0 


(18) 
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After  some  linear  algebra,  we  obtain 


Since  Zk  —  p  —  Xk  —  Y&,  we  have  E (Zf.)  —  p{  1  —  1  /p)k- 

In  our  algorithms,  we  choose  p  —  ck  for  some  small  integer  c.  Using  this  and 
the  approximation  (1  +  ^ )n  m  ex,  we  have 


E 


ck(l 


ke  1//c 


(20) 


This  gives  a  non-linear  equation  for  the  expected  number  of  collisions  among  k 
frequencies  as  a  function  of  the  parameter  c.  Newton’s  method  can  then  be  used  to 
determine  the  value  c  required  to  ensure  a  desired  fraction  of  the  frequencies  are 
not  aliased.  For  example,  to  ensure  that  90%  of  frequencies  are  isolated  on  average, 
it  suffices  to  take  c  =  5;  this  value  for  the  parameter  c  had  already  been  found  to 
give  good  performance  in  our  empirical  evaluation  of  the  algorithms. 


4.4.  Average-case  runtime  and  sampling  complexity 

In  this  section,  we  will  use  a  probabilistic  recurrence  relation  due  to  Karp  [Karp 
(1994);  Dubhashi  and  Panconesi  (2009)]  to  give  average-case  performance  bounds 
and  concentration  results  for  the  case  when  the  algorithm  is  halted  after  identifying 
k  or  more  terms.  In  particular,  we  use  the  following  theorem  for  recurrences  of  the 
form 


T(k)  =  a(k)  +  T(H(k)),  (21) 

where  T(k)  denotes  the  time  required  to  solve  an  instance  of  size  k,  a{k)  is  the 
amount  of  work  done  on  a  problem  of  size  k,  and  0  <  H{k)  <  k  is  a  random 
variable  denoting  the  size  of  the  subproblem  generated  by  the  algorithm. 

Theorem  2.  [Karp  (1994,  Theorem  1.2)]  Suppose  a(k )  is  non- decreasing, 
continuous ,  and  strictly  increasing  on  {x  :  a(x)  >  0},  and  that  E [H(k)]  <  m(k )  for 
a  non- decreasing  continuous  function  m(k)  such  thatm(k)/k  is  also  non- decreasing. 
Denote  by  u(k )  the  solution  to  the  deterministic  recurrence 

u(k)  =  a{k)  +  u(m(k )) 

Then  for  k  >  0  and  t  G  N, 

m(k)  \  * 

~T~  J  ' 
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Our  algorithm  does  work  a(k)  —  ©(fclog(fc))  on  input  of  size  k  and  generates  a 
subproblem  whose  average  size  is  m{k )  =  fc/10.  (Recall  from  Sec.  4.3  that  with  the 
parameter  c  =  5,  on  average  over  90%  of  the  frequencies  were  not  aliased  modulo 
p  —  O (ck).)  The  associated  deterministic  recurrence  is  then 

u(k)  —  0(fc  log(fc))  +  u(fc/10),  (24) 

whose  solution  is  u(k)  —  0(ATog(fc))  [see,  e.g.  Cormen  et  al.  (2001)].  A  straightfor¬ 
ward  application  of  Theorem  2  yields  the  following 

Theorem  3  (Runtime  bound).  Let  T{k )  denote  the  runtime  of  Algorithm  1  on 
a  random  signal  drawn  from  the  class  in  Sec.  4.2.  Then  E[T(/c)]  —  Q(klog(k))  and 

P[T(/c)  >  0(/clog(/c))  +  tk\og(k)]  <  10_t.  (25) 

The  sampling  complexity  S(k)  can  be  handled  in  an  analogous  manner,  since 
in  this  case  a{k )  =  0(/c)  and  m{k )  =  k/ 10  as  before.  The  associated  deterministic 
recurrence  becomes 


u(k)  —  Q(k)  +  u(k/ 10),  (26) 

whose  solution  is  u(k )  =  0(/c).  Applying  Theorem  2  again  we  have  the  following 

Theorem  4  (Sampling  bound).  Let  S{k )  denote  the  number  of  samples  used  by 
Algorithm  1  on  a  random  signal  drawn  from  the  class  in  Sec.  4.2.  Then  E[5”(/c)]  = 
0(/c)  and 

P[S'(fc)  >  0(fc)  +  tk]  <  10“*.  (27) 


5.  Empirical  Evaluation 

In  this  section,  we  describe  the  results  of  an  empirical  evaluation  of  the  adaptive 
deterministic  and  Las  Vegas  variants  of  the  Phaseshift  algorithm  described  above. 
Both  algorithms  were  implemented  in  C++  using  FFTW  3.0  [Frigo  and  Johnson 
(2005)]  for  the  FFTs,  using  FFTW_ESTIMATE  plans  since  the  sample  lengths  are 
not  known  in  advance  for  the  Las  Vegas  variant.  For  comparison,  we  also  ran  the 
same  tests  on  the  four  variants  of  GFFT  as  well  as  on  AAFFT  and  FFTW  itself. 
The  FFTW  runs  utilized  the  FFTW_PATIENT  plans  with  wisdom  enabled,  and  so 
are  highly  optimized.  The  experiments  were  run  on  a  single  core  of  an  Intel  Xeon 
E5620  CPU  with  a  clock  speed  of  2.4  GHz  and  24  GB  of  RAM,  running  SUSE 
Linux  with  kernel  2.6.16.60-0.81.2-smp  for  x86_64.  All  code  was  compiled  with  the 
Intel  compiler  using  the  -fast  optimization.  As  in  Iwen  [2012],  timing  is  reported 
in  CPU  ticks  using  the  cycle. h  hie  included  with  the  source  code  for  FFTW. 

In  the  following  sections,  we  refer  to  our  algorithm  as  “Phaseshift”,  since  by 
taking  shifted  time  samples  of  the  input  signal  we  also  shift  the  phase  of  the 
Fourier  coefficients.  To  keep  the  plots  readable,  we  only  show  data  for  the  adaptive, 
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deterministic  variant  of  our  algorithm;  the  other  variants  perform  similarly  the 
algorithms  of  Iwen  [2012]  are  denoted  GFFT-XY,  where  X  E  {D,R}  and  Y  E  {F,S}. 
The  D/R  stands  for  deterministic  or  randomized,  while  the  F/S  stands  for  fast  or 
slow.  The  fast  variants  use  more  samples  but  less  runtime  while  the  slow  variants 
use  fewer  samples  but  more  runtime.  In  the  plots  below,  we  always  show  the  GFFT 
variant  with  the  most  favorable  sampling  or  runtime  complexity.  Finally,  AAFFT 
denotes  the  algorithm  of  Gilbert  et  al.  [2005] .  The  implementations  tested  are  sum¬ 
marized  in  Table  1  along  with  the  average-case  sampling  and  runtime  complexities, 
and  the  associated  references. 


5.1.  Setup 

Each  data  point  in  Figs.  1-2  is  the  average  of  100  independent  trials  of  the  associated 
algorithm  for  the  given  values  of  the  bandwidth  N  and  the  sparsity  k.  The  lower  and 
upper  bars  associated  with  each  data  point  represent  the  minimum  and  maximum 
number  of  samples  or  runtime  of  the  algorithm  over  the  100  test  functions.  The 


Table  1.  Implementations  used  in  the  empirical  evaluation. 


Algorithm 

R/D 

Samples 

Runtime 

Reference 

PS-Det 

D 

k 

k  log  k 

Section  4 

PS-LV 

R 

k 

k  log  k 

Section  4 

GFFT-DF 

D 

k2  log4  N 

k2  log4  N 

[Iwen  (2012)] 

GFFT-DS 

D 

k2  log2  N 

Nk  log2  N 

[Iwen  (2012)] 

GFFT-RF 

R 

k  log4  N 

k  log5  N 

[Iwen  (2012)] 

GFFT-RS 

R 

k  log2  N 

N  log  N 

[Iwen  (2012)] 

AAFFT 

R 

k  logc  N 

k  logc  N 

[Gilbert  et  al.  (2005)] 

FFTW 

D 

N 

N  log  N 

[Frigo  and  Johnson  (2005)] 

Samples  for  Fixed  Bandwidth  N  =  2s2 


Fig.  1.  (Color  online)  (a)  Sampling  complexity  with  fixed  bandwidth  N  —  222  for  PS-Det  (blue 
solid  line),  GFFT-RS  (red  solid  line),  AAFFT  (black  dashed  line),  and  FFTW  (magenta  dashed 
line),  (b)  Runtime  complexity  with  fixed  bandwidth  N  =  222  for  PS-Det  (blue  solid  line),  GFFT- 
RF  (red  solid  line),  AAFFT  (black  dashed  line),  and  FFTW  (magenta  dashed  line). 
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Runtime  for  Fixed  Bandwidth  N  =  2s2 


Fig.  1.  ( Continued ) 


Samples  for  Fixed  Sparsity  k  =  60 
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Runtime  for  Fixed  Sparsity  k  =  60 


Fig.  2.  (Color  online)  (a)  Sampling  complexity  with  fixed  sparsity  k  —  60  for  PS-Det  (blue  solid 
line),  GFFT-RS  (red  solid  line),  AAFFT  (black  dashed  line),  and  FFTW  (magenta  dashed  line), 
(b)  Runtime  complexity  with  fixed  sparsity  k  —  60  for  PS-Det  (blue  solid  line),  GFFT-RF  (red 
solid  line),  AAFFT  (black  dashed  line),  and  FFTW  (magenta  dashed  line). 
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values  of  k  tested  were  2, 4,  8, ... ,  4096,  while  the  values  of  N  were  217,  218, . . . ,  226. 
For  larger  values  of  k,  the  slow  GFFT  variants  and  AAFFT  took  too  long  to  com¬ 
plete  on  our  hardware,  so  we  only  present  partial  data  for  these  algorithms.  Never¬ 
theless,  the  trend  seen  in  the  plots  below  continues  for  higher  values  of  the  sparsity. 
The  test  signals  were  generated  according  to  the  signal  model  described  in  Sec.  4.2. 

The  Phaseshift  and  deterministic  GFFT  variants  will  always  recover  such  signals 
exactly.  The  randomized  GFFT  variants  are  Monte  Carlo  algorithms,  and  so,  when 
they  succeed,  will  also  recover  the  signal  exactly.  AAFFT,  on  the  other  hand,  is 
an  approximation  algorithm  which  will  fail  on  a  non-negligible  set  of  input  signals. 
However,  for  the  runs  depicted  in  Figs.  1-2,  AAFFT  always  produced  an  answer 
with  £2  error  less  than  10~4.  The  randomized  GFFT  variants  failed  a  total  of  7  times 
out  of  2,200  test  signals,  a  relatively  small  amount  that  can  be  reduced  by  parameter 
tuning.  For  the  Phaseshift  variants,  we  chose  the  parameters  c\  —  5,  C2  =  10,  and 
took  the  shift  £  to  be  1/2 N .  Finally,  for  the  randomized  GFFT  variants,  we  chose 
the  Monte  Carlo  parameter  to  be  1.2. 

5.2.  Sampling  complexity 

In  Fig.  1(a),  we  compare  the  average  number  of  samples  of  the  input  signal  S 
required  by  each  algorithm  when  the  bandwidth  N  fixed  at  222.  The  sparsity  of 
the  test  signal  is  varied  from  2  to  4096  by  powers  of  two.  We  can  see  that  the 
Phaseshift  variants  require  over  an  order  of  magnitude  fewer  samples  than  GFFT- 
RS,  the  GFFT  variant  with  the  lowest  sampling  requirements.  Both  Phaseshift 
variants  also  require  over  an  order  of  magnitude  fewer  samples  than  AAFFT.  The 
comparison  with  the  deterministic  GFFT  variants  is  even  starker;  Phaseshift-Det 
requires  two  orders  of  magnitude  fewer  samples  than  GFFT-DS  (not  shown),  and 
four  orders  of  magnitude  fewer  samples  than  GFFT-DF  (not  shown). 

In  Fig.  2(a),  we  compare  the  average  number  of  samples  of  the  input  signal  S 
required  by  each  algorithm  when  the  sparsity  k  is  fixed  at  60.  The  bandwidth  N 
was  varied  from  21'  -22G  by  powers  of  two.  Using  powers  of  two  for  the  bandwidth 
allows  the  best  performance  for  both  FFTW  and  AAFFT,  though  this  fact  is  more 
relevant  for  the  runtime  comparisons  in  the  following  section.  We  can  see  that  the 
Phaseshift  variants  require  many  fewer  samples  than  all  four  GFFT  variants  as  well 
as  AAFFT  and  FFTW,  for  all  values  of  N  tested.  The  Phaseshift  variants  exhibit 
almost  no  dependence  on  the  bandwidth  for  all  values  of  N,  a  feature  not  shared 
by  the  other  deterministic  algorithms.  We  note  here  that  in  future  work  we  plan  to 
replace  the  1/2 N  shift  by  two  or  more  larger  shifts  with  co-prime  denominators  to 
obtain  an  equivalent  shift,  as  in  Wang  and  Zhou  [1998].  This  should  lead  to  more 
robustness  at  high  values  of  N. 

5.3.  Runtime  complexity 

In  Fig.  1(b),  we  compare  the  average  runtime  of  each  algorithm  over  100  test  signals 
when  the  bandwidth  N  is  fixed  at  222.  The  range  of  sparsity  k  considered  is  the 
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same  as  in  Sec.  5.2.  For  all  values  of  k  the  Phaseshift  variants  are  faster  than  GFFT- 
RF  (the  fastest  GFFT  variant)  and  AAFFT  by  more  than  an  order  of  magnitude. 
When  compared  to  GFFT-RS  (not  shown),  GFFT-DS  (not  shown),  and  FFTW, 
the  difference  in  runtime  is  closer  to  three  orders  of  magnitude. 

In  Fig.  2(b),  we  compare  the  average  runtime  of  each  algorithm  over  100  test 
signals  when  the  sparsity  k  is  fixed  at  60.  The  range  of  bandwidth  considered  is  the 
same  as  in  Sec.  5.2.  The  Phaseshift  variants  are  the  only  algorithms  that  outperform 
FFTW  for  all  values  of  N  tested.  The  other  implementations  tested  only  become 
competitive  with  the  standard  FFT  for  N  >  220,  while  ours  are  faster  even  for 
modest  N. 

5.4.  Noisy  case 

We  report  here  on  a  preliminary  study  of  the  performance  of  the  deterministic 
algorithm  in  the  presence  of  noise.  Our  noisy  signals  were  of  the  same  form  as  in 
the  previous  section,  but  with  complex  white  gaussian  noise  of  standard  deviation 
a  added  to  each  measurement.  As  described  in  Sec.  3.3,  the  simplest  way  to  deal 
with  low-level  noise  is  to  simply  round  the  reconstructed  frequencies  to  the  nearest 
integer  of  the  form  apj  +  6,  where  b  =  u  mod  pj  is  the  location  of  the  peak  in  a 
length-pj  DFT.  This  modification  doesn’t  change  the  runtime  or  sampling  complex¬ 
ity  significantly,  so  in  this  section  we  focus  on  the  error  in  the  approximation  as  a 
function  of  the  noise  level  a  and  the  parameter  c\. 

In  the  existing  literature  on  the  sparse  Fourier  transform,  the  i 2  norm  is  most 
often  used  to  assess  the  quality  of  approximation.  There  are  many  reasons  for  this 
choice,  with  the  two  most  convincing  perhaps  being  the  completeness  of  the  com¬ 
plex  exponentials  with  respect  to  the  £2  norm  and  Parseval’s  theorem.  For  certain 
applications,  however,  this  choice  of  norm  is  inappropriate.  For  example,  in  wide¬ 
band  spectral  estimation  and  radar  applications,  one  is  interested  in  identifying  a 
set  of  frequency  intervals  containing  active  Fourier  modes.  In  this  case,  an  estimate 
uj  of  the  true  frequency  ui  with  \uj  —  u\  «JV  is  useful,  but  unless  l u  =  u>  the  £2  met¬ 
ric  will  report  an  0(1)  error.  Furthermore,  when  considering  non-periodic  signals 
(equivalently,  non-integer  cn’s)  the  same  precision  problem  appears  when  using  the 
£2  metric. 

For  these  reasons,  we  propose  measuring  the  approximation  error  of  sparse 
Fourier  transform  problems  with  the  Earth  Mover  Distance  (EMD)  [Rubner  et  al. 
(2000)].  Originally  developed  in  the  context  of  content-based  image  retrieval,  EMD 
measures  the  minimum  cost  that  must  be  paid  (with  a  user-specified  cost  func¬ 
tion)  to  transform  one  distribution  of  points  into  another.  EMD  can  be  calculated 
efficiently  as  the  solution  of  a  linear  program  corresponding  to  a  certain  flow  mini¬ 
mization  problem.  In  our  situation,  we  consider  the  cost  to  move  a  set  of  estimated 
Fourier  modes  and  coefficients  {( Uj ,  )}^=1  to  the  true  values  {(a;*,  cUj)}j=1  under 
the  cost  function 

di((w,cw),  ( u,cz)]N )  =f  ^  +  |Cw  -  cc 
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Fig.  3.  EMD(l)  error  as  a  function  of  the  noise  level  a  for  various  choices  of  the  parameter  c\. 
The  sparsity  and  bandwidth  are  fixed  at  k  —  64,  N  —  222,  respectively. 

This  choice  of  cost  function  strikes  a  balance  between  the  fidelity  of  the  frequency 
estimate  (as  a  fraction  of  the  bandwidth)  and  that  of  the  coefficient  estimate.  We 
denote  the  EMD  using  d\  for  the  cost  function  by  EMD(l)  below. 

In  Fig.  3,  we  report  the  average  EMD(l)  error  over  100  test  signals  as  a  function 
of  the  input  noise  level  cr,  for  various  choices  of  the  parameter  c\.  In  this  experiment, 
the  sparsity  and  bandwidth  are  fixed  at  k  —  64  and  N  —  222,  respectively.  As 
expected,  the  error  decreases  as  c\  increases,  since  the  rounding  procedure  described 
in  Sec.  3.3  is  more  likely  to  result  in  the  true  frequency.  Moreover,  the  error  increases 
linearly  with  the  noise  level,  indicating  the  procedure’s  robustness  in  the  presence 
of  noise. 

We  remark  that  in  the  noiseless  case  the  choice  ci  =  5  was  found  to  be  suf¬ 
ficient,  while  Fig.  3  indicates  that  the  much  larger  value  C\  ~  256  is  necessary 
for  good  approximation  in  the  EMD(l)  metric.  The  larger  sample  lengths  imply  an 
increase  in  both  the  runtime  and  sampling  complexity,  and  indicate  that  the  round¬ 
ing  procedure  of  Sec.  3.3  should  be  complemented  by  other  modifications.  This  is 
the  purpose  of  a  second  manuscript  under  preparation,  in  which  we  combine  the 
rounding  procedure  with  the  use  of  larger  shifts  £j  in  a  multiscale  approach  to 
frequency  estimation. 

6.  Conclusion 

In  this  paper,  we  have  presented  a  deterministic  and  Las  Vegas  algorithm  for  the 
sparse  Fourier  transform  problem  that  empirically  outperform  existing  algorithms 
in  average-case  sampling  and  runtime  complexity.  While  our  worst-case  bounds 
do  not  improve  the  asymptotic  complexity,  we  are  able  to  extend  by  an  order  of 
magnitude  the  range  of  sparsity  for  which  our  algorithm  is  faster  than  FFTW  in 
the  average  case.  The  improved  performance  of  our  algorithm  can  be  attributed 
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to  two  major  factors:  adaptivity  and  ability  to  detect  aliasing.  In  particular,  we 
are  able  to  extract  more  information  from  a  small  number  of  function  samples  by 
considering  the  phase  of  the  DFT  coefficients  in  addition  to  their  magnitudes.  This 
represents  a  significant  improvement  over  the  current  state  of  the  art  for  the  sparse 
Fourier  transform  problem. 

We  have  developed  a  multiresolution  approach  to  handle  the  noisy  case,  in  which 
we  learn  the  value  of  a  frequency  from  most  to  least  significant  bit  by  increasing 
the  size  of  the  shift  e.  Finally,  we  are  exploring  the  extension  of  these  methods  to 
handle  non-integer  frequencies,  which  would  represent  the  first  such  result  in  the 
sparse  Fourier  transform  context. 
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1  Introduction 

Modern  high  performance  computers  offer  hundreds  of  thousands  of  processors  that 
can  be  leveraged,  in  parallel,  to  compute  numerical  solutions  to  time  dependent  par¬ 
tial  differential  equations  (PDEs).  For  grid-based  solutions  to  these  PDEs,  domain 
decomposition  (DD)  is  often  employed  to  add  spatial  parallelism  [19]. 

Parallelism  in  the  time  variable  is  more  difficult  to  exploit  due  to  the  inherent 
causality.  Recently,  researchers  have  explored  this  issue  as  a  means  to  improve  the 
scalability  of  existing  parallel  spatial  solvers  applied  to  time  dependent  problems. 
There  are  several  general  approaches  to  combine  temporal  parallelism  with  spatial 
parallelism.  Waveform  relaxation  [15]  is  an  example  of  a  “parallel  across  the  prob¬ 
lem”  method.  The  “parallel  across  the  time  domain”  approaches  include  the  parareal 
method  [11,  17,  16].  The  parareal  method  decomposes  a  time  domain  into  smaller 
temporal  subdomains  and  alternates  between  applying  a  coarse  (relatively  fast)  se¬ 
quential  solver  to  compute  an  approximate  (not  very  accurate)  solution,  and  apply¬ 
ing  a  fine  (expensive)  solver  on  each  temporal  subdomain  in  parallel.  Alternatively, 
one  can  consider  “parallel  across  the  step”  methods.  Examples  of  such  approaches 
include  the  computation  of  intermediate  Runge-Kutta  stage  values  in  parallel  [18], 
and  Revisionist  Integral  Deferred  Correction  (RIDC)  methods,  which  are  the  family 
of  parallel  time  integrators  considered  in  this  paper.  Parallel  across  the  step  meth¬ 
ods  allow  for  “small  scale”  parallelism  in  time.  Specifically,  we  will  show  that  if  a 
DD  implementation  scales  to  Nx  processors,  a  RIDC-DD  parallelism  will  scale  to 
Nt  x  Nx  processors,  where  Nt  <  12  in  practice.  This  contrasts  with  parallel  across 
the  time  domain  approaches,  which  can  potentially  utilize  Nt  12. 

This  paper  discusses  the  implementation  details  and  profiling  results  of  the  par¬ 
allel  space-time  RIDC-DD  algorithm  described  in  [5].  Two  hybrid  OpenMP  -  MPI 
frameworks  are  discussed:  (i)  a  more  traditional  fork-join  approach  of  combining 
threads  before  doing  MPI  communications,  and  (ii)  a  threaded  MPI  communications 
framework.  The  latter  framework  is  highly  desirable  because  existing  (spatially  par¬ 
allel)  legacy  software  can  be  easily  integrated  with  the  parallel  time  integrator.  Nu¬ 
merical  experiments  measure  the  communication  overhead  of  both  frameworks,  and 
demonstrate  that  the  fork-join  approach  scales  well  in  space  and  time.  Our  results 
indicate  that  one  should  strongly  consider  temporal  parallelization  for  the  solution 
of  time  dependent  PDEs. 
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USA  e-mail:  ongbw@msu .  edu 
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2  Review 

This  paper  is  interested  in  parallel  space-time  solutions  to  the  linear  heat  equation. 
We  describe  the  application  of  our  method  to  the  linear  heat  equation  in  one  spatial 
dimension  x  6  [0, 1]  and  t  6  [0,7’], 

Mr  =  Uxx ,  u(t,0)  =  go{t),  u(t,  1)  =gi(t),  u(0,x)  =  uo(x).  (1) 

The  actual  numerical  results  in  §4  are  presented  for  the  2D  heat  equation. 


2.1  RIDC 


RIDC  methods  [6,  7]  are  a  family  of  parallel  time  integrators  that  can  be  broadly 
classified  as  predictor  corrector  algorithms  [10,  2].  The  basic  idea  is  to  simultane¬ 
ously  compute  solutions  to  the  PDE  of  interest  and  associated  error  PDEs  using  a 
low-order  time  integrator.  We  first  review  the  derivation  of  the  error  equation. 

Suppose  v(t,x)  is  an  approximate  solution  to  (1),  and  u(t,x)  is  the  (unknown) 
exact  solution.  The  error  in  the  approximate  solution  is  e(t,x)  =  u(t,x)  —  v(t,x).  We 
define  the  residual  as  e(t,x)  =  vt(t,x)  —  vxx(t,x).  Then  the  time  derivative  of  the 
error  satisfies  et  =  ut  —  vt  =  uxx  —  (v**  +  e).  The  integral  form  of  the  error  equation. 


e  + 


(v  +  e)xx-vxx, 


(2) 


can  then  be  solved  for  e(t,x )  using  the  initial  condition  e(0,x)  =  0.  The  correction 
e(t,x )  is  combined  with  the  approximate  solution  v(t,x)  to  form  an  improved  so¬ 
lution.  This  improved  solution  can  be  fed  back  in  to  the  error  equation  (2)  and  the 
process  repeated  until  a  sufficiently  accurate  solution  is  obtained.  It  has  been  shown 
that  each  application  of  the  error  equation  improves  the  order  of  the  overall  method, 
provided  the  integral  is  approximated  with  sufficient  accuracy  using  quadrature  [8]. 

We  introduce  some  notation  to  identify  the  sequence  of  corrected  approxima¬ 
tions.  Denote  v^(t,x)  as  the  approximate  solution  which  has  error  e^(t,x),  which 
is  obtained  by  solving 


M . 


■/ 


+  /  e^(T  ,x)dz 


*(vlri+el>])„-v &] 


(3) 


where  v[°l(f,x)  denotes  the  initial  approximate  solution  obtained  by  solving  the 
physical  PDE  (1)  using  a  low-order  integrator.  In  general,  the  error  from  the  pth 
correction  equation  is  used  to  construct  the  (p+  l)xf  approximation,  i4p+1](f,jc)  = 
v[pl(f,x)  +  elpl (t.x).  Hence,  equation  (3)  can  be  expressed  as 


vlP+'l  - 


=  Vn+1'  -  Vxt  . 


(4) 
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We  compute  a  low-order  prediction,  vl°l’”+1,  for  the  solution  of  (1)  at  time  tn+\ 
using  a  first-order  backward  Euler  discretization  (in  time): 

v[0],»+i  _  Atv®’n+1  =  vPK  v^+V)  =  go(r"+1),  v^n+\b)  =  gi(r"+1),  (5) 

with  v  °]-0(^)  =  uq(x).  With  some  algebra,  a  first-order  backward  Euler  discretization 
of  equation  (4)  gives  the  update,  v^+1 1,B+1,  as 


with  v^+1l’"+1(fl)  =  go(r”+1)  and  v'lP+^,n+l (b)  =  gi(t'i+1).  The  integral  in  equa¬ 
tion  (6)  is  approximated  using  a  sufficiently  high-order  quadrature  rule  [8], 

Parallelism  in  time  is  possible  because  the  PDE  of  interest  (1)  and  the  error 
PDEs  (4)  can  be  solved  simultaneously,  after  initial  startup  costs.  The  idea  is  to  fill 
out  the  memory  footprint,  which  is  needed  so  that  the  integral  in  equation  (6)  can  be 
approximated  by  high-order  quadrature,  before  marching  solutions  to  (5)  and  (6)  in 
a  pipe-line  fashion.  See  Figure  1  for  a  graphical  example,  and  [6]  for  more  details. 


Error  PDE  for  v^(t,x) 

•  - 

O- 

3rd  correction 

Error  PDE  for  v^(t,x) 

• 

• 

• 

• 

o 

2nd  correction 

Error  PDE  for  v^(t,x) 

• 

• 

• 

o 

1st  correction 

Original  PDE  for  u[0](£,x) 

• 

• 

O 

tn- 3  tn-2  tn- 1  tn  £n+ 1 


Fig.  1  The  black  dots  represent  the  memory  footprint  that  must  be  stored  before  the  white  dots  can 
be  computed  in  a  pipe.  In  this  figure,  v^’”+2(jc),  vM,n+1(jt),  v^,n(jc)  and  v^,n_1(x)  are  computed 
simultaneously. 


2.2  RIDC-DD 

The  RIDC-DD  algorithm  solves  the  predictor  (5)  and  corrections  (6)  using  DD  al¬ 
gorithms  in  space.  The  key  observation  is  that  (5)  and  (6)  are  both  elliptic  PDEs  of 
the  form  (1—4?  <9xv)z  =  f(x).  The  function  f(x)  is  known  from  the  solution  at  the 
previous  time  step  and  previously  computed  lower-order  approximations.  DD  algo¬ 
rithms  for  solving  elliptic  PDEs  are  well  known  [3,  4],  The  general  idea  is  to  replace 
the  PDE  by  a  coupled  system  of  PDEs  over  some  partitioning  of  the  spatial  domain 
using  overlapping  or  non-overlapping  subdomains.  The  coupling  is  provided  by 
necessary  transmission  conditions  at  the  subdomain  boundaries.  These  transmission 
conditions  are  chosen  to  ensure  the  DD  algorithm  converges  and  to  optimize  the  con- 
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vergence  rate.  In  [5],  as  a  proof  of  principle,  (5-6)  are  solved  using  a  classical  par¬ 
allel  Schwarz  algorithm,  with  overlapping  subdomains  and  Dirichlet  transmission 
conditions.  Optimized  RIDC-DD  variants  are  possible  using  an  optimized  Schwarz 
DD  method  [13,  12,  9],  to  solve  (5-6).  The  solution  from  the  previous  time  step  can 
be  used  as  initial  subdomain  solutions  at  the  interfaces.  We  will  use  RIDCp-DD  to 
refer  to  a  /nh-order  solution  obtained  using  p—  1  RIDC  corrections  in  time  and  DD 
in  space. 


3  Implementation  Details 

We  view  the  parallel  time  integrator  reviewed  in  §2. 1  as  a  simple  yet  powerful  tool  to 
add  further  scalability  to  a  legacy  MPI  or  modern  MPI-CUDA  code,  while  improv¬ 
ing  the  accuracy  of  numerical  solution.  The  RIDC  integrators  benefit  from  access  to 
shared  memory  because  solving  the  correction  PDE  (6)  requires  both  the  solution 
from  the  previous  time  step  and  previously  computed  lower-order  subdomain  solu¬ 
tion.  Consequently,  we  propose  two  MPI-OpenMP  hybrid  implementations  which 
map  well  to  multi-core,  multi-node  compute  resources.  In  the  upcoming  MPI  3.0 
standard  [1],  shared  memory  access  within  the  MPI  library  will  provide  alternative 
implementations. 

Implementation  #1:  The  RIDC-DD  algorithm  can  be  implemented  using  a  tra¬ 
ditional  fork  join  approach,  as  illustrated  in  Program  1.  After  boundary  information 
is  exchanged,  each  MPI  task  spawns  OpenMP  threads  to  perform  the  linear  solve. 
The  threads  are  merged  back  together  before  MPI  communication  is  used  to  check 
for  convergence.  The  drawback  to  this  fork-join  implementation,  is  that  the  parallel 
space-time  algorithm  becomes  tightly  integrated,  making  it  difficult  to  leverage  an 
existing  spatially  parallel  DD  implementation. 


1.  MPI  Initialization 

2. 

3.  for  each  time  step 

4.  for  each  Schwarz  iteration 

5.  MPI  Comm  (exchange  boundary  info) 

6.  OMP  Parallel  for  each  prediction/correction 

7.  linear  solve 

8 .  end  parallel 

9.  MPI  Comm  (check  for  convergence) 

10.  end 

11.  end 

12.  ... 

13.  MPI  Finalize 


Program  1:  RIDC-DD  implementation  using  a  fork-join  approach.  The  time  parallelism  occurs 
within  each  Schwarz  iteration,  requiring  a  tight  integration  with  an  existing  spatially  parallel  DD 
implementation. 
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Implementation  #2:  To  leverage  an  existing  spatially  parallel  DD  implemen¬ 
tation,  a  non-traditional  hybrid  approach  must  be  considered.  By  changing  the  or¬ 
der  of  the  loops,  the  Schwarz  iterations  for  the  prediction  and  the  correction  loops 
can  be  evaluated  independently  of  each  other.  This  is  realized  by  spawning  indi¬ 
vidual  OpenMP  threads  to  solve  the  prediction  and  correction  loops  on  each  sub- 
domain;  the  Schwarz  iterations  for  the  prediction/correction  step  run  independently 
of  each  other  until  convergence.  This  implementation  (Program  2)  has  several  con¬ 
sequences:  (i)  a  thread  safe  version  of  MPI  supporting  MPI_THREAD .MULTIPLE 
is  required,  (ii)  In  addition,  we  required  a  thread-safe,  thread-independent  ver¬ 
sion  of  MPI  .BARRIER,  MP  I.BROADCAST  and  MP  I.GATHER.  To  achieve  this,  we 
wrote  our  own  wrapper  library  using  the  thread  safe  MPI.SEND,  MPI.RECV  and 
MP  I.SENDRECV  provided  by  (i). 


1.  MPI  Initialization 

2. 

3.  for  each  time  step 

4.  OMP  Parallel  for  each  prediction/correction  level 

5.  for  each  Schwarz  iteration 

6.  MPI  Comm  (exchange  boundary  info) 

7 .  linear  solve 

8.  MPI  Comm  (check  for  convergence) 

9 .  end 

10.  end  parallel 

11.  end 

12.  ... 

13.  MPI  Finalize 


Program  2:  RIDC-DD  implementation  using  a  non-traditional  hybrid  approach.  Notice  that  lines 
5-9  are  the  Schwarz  iterations  that  one  would  find  in  an  existing  spatially  parallel  DD  implementa¬ 
tion.  Hence,  provided  the  DD  implementation  is  thread-safe,  one  could  wrap  the  time  paralleliza¬ 
tion  around  an  existing  parallel  DD  implementation. 


4  Numerical  Experiments 

We  show  first  that  RIDC-DD  methods  converge  with  the  designed  orders  in  space 
and  time.  Then,  we  profile  communication  costs  using  TAU  [14].  Finally,  we  show 
strong  scaling  studies  for  the  RIDC-DD  algorithm.  We  compute  solutions  to  the 
heat  equation  in  R2,  where  centered  finite  differences  are  used  to  approximate  the 
second  derivative  operator.  Errors  are  computed  using  the  known  analytic  solution. 
The  computations  are  performed  at  the  High  Performance  Computing  Cenrer  at 
Michigan  State  University,  where  nodes  (consisting  of  two  quad  core  Intel  Westmere 
processors)  are  interconnected  using  infiniband  and  a  high  speed  Lustre  file  system. 
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4.1  Convergence  Studies  and  Profile  Analysis 

In  Figure  2,  the  convergence  plots  show  that  our  classical  Schwarz  RIDC-DD  al¬ 
gorithm  converges  as  expected  in  space  and  time.  In  general,  one  would  balance 
the  orders  of  the  errors  in  space  and  time  appropriately  for  efficiency.  Here  we  pick 
RIDC4  since  it  mapped  well  to  our  available  four  core  sockets  and  to  demonstrate 
the  scalability  of  our  algorithm  in  time.  We  could,  of  course,  used  a  fourth  order 
method  in  space.  The  Schwarz  iterations  are  iterated  until  a  tolerance  of  10~12 
is  reached  for  the  predictors  and  correctors  (which  explains  why  the  error  in  the 
fourth-order  approximation  levels  out  as  the  time  step  becomes  small). 


dt 

(a)  Time  Convergence 


(b)  Space  Convergence 


Fig.  2  (a)  Classical  Schwarz  RIDC/>DD  algorithms,  p  =  1,2, 3,4,  converge  to  the  reference  so¬ 
lution  with  the  designed  orders  of  accuracy.  Here  Ax  is  fixed  while  At  is  varied,  (b)  Second-order 
convergence  in  space  is  demonstrated  for  the  fourth-order  RIDC-DD  algorithm.  Here,  dt  is  fixed 
while  Ax  is  varied. 


The  communication  costs  for  our  two  implementations  of  RIDC4-DD  are  pro¬ 
filed  using  TAU  [14],  We  see  in  Figure  3,  communication  costs  are  minimal  for 
implementation  #1,  and  scales  nicely  as  the  number  of  nodes  is  increased,  but  the 
communication  cost  is  significant  for  implementation  #2.  In  Figure  3(a,c),  the  do¬ 
main  is  discretized  into  180  x  180  grid  nodes,  which  are  split  into  a  3  x  3  config¬ 
uration  of  subdomains.  In  Figure  3(b,d),  the  domain  is  discretized  into  360  x  360 
grid  nodes,  which  are  split  into  a  6  x  6  configuration  of  subdomains.  This  keeps  the 
number  of  grid  points  per  subdomain  constant  so  that  the  computation  time  for  the 
matrix  factorization  and  linear  solve  are  the  same. 


4.2  Characterizing  Parallel  Performance 

Due  to  the  better  communication  profile,  we  use  framework  #1  for  our  experiments. 
We  fix  Ax  =  ygo ,  Ay  =  At  =  and  TOL=10~12  (the  Schwarz  iteration  tol¬ 
erance).  We  consider  three  configurations  of  subdomains:  2  x  2,  4  x  4  and  6x6.  For 
each  configuration  we  illustrate  the  speedup  and  efficiency  due  to  the  time  paral¬ 
lelism  in  Figure  4.  We  choose  to  fix  the  ratio  between  the  overlap  and  subdomain 
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Overhead 


(a)  Implementation  #1  (b)  Implementation  #1  (c)  Implementation  #2  (d)  Implementation  #2 
3x3  domain  6x6  domain  3x3  domain  6x6  domain 

Fig.  3  Profile  of  the  RIDC4-DD  algorithm  using  both  implementations.  Overhead  and  communi¬ 
cation  costs  are  reasonable  for  implementation  #1,  but  are  high  for  implementation  #2. 


size  to  ensure  the  number  of  unknowns  on  each  subdomain  scales  appropriately  as 
the  number  of  subdomains  is  increased. 

In  Figure  4  we  show  three  curves  corre¬ 
sponding  to  a  2  x  2,  4  x  4  and  a  6  x  6  config¬ 
uration  of  subdomains.  For  each  configuration 
we  compute  a  fourth  order  solution  in  time  us¬ 
ing  1 , 2  and  4  threads.  The  6x6  configuration 
of  subdomains  with  4  threads  uses  a  total  of  144 
cores.  We  plot  the  efficiency  (with  respect  to  the 
one  thread  run)  as  a  function  of  the  number  of 
threads.  Speedup  is  evident  as  temporal  paral¬ 
lelization  is  improved,  however,  efficiency  de¬ 
creases  as  the  number  of  subdomains  increases. 


Fig.  4  Scaling  study  (in  time)  for  a 
RIDC4-DD  algorithm. 


5  Conclusions 

This  paper  has  presented  the  implementation  details  and  first  reported  profiling  re¬ 
sults  for  a  newly  proposed  space-time  parallel  algorithm  for  time  dependent  PDEs. 
The  RIDC-DD  method  combines  traditional  domain  decomposition  in  space  with 
a  new  family  of  deferred  correction  methods  designed  to  allow  parallelism  in  time. 
Two  possible  implementations  are  described  and  profiled.  The  first,  a  traditional  hy¬ 
brid  OpenMP-MPI  implementation,  requires  potentially  difficult  modifications  of 
an  existing  parallel  spatial  solver.  Numerical  experiments  verify  that  the  algorithm 
achieves  its  designed  order  of  accuracy  and  scales  well.  The  second  strategy  al¬ 
lows  a  relatively  easy  reuse  of  an  existing  parallel  spatial  solver  by  using  OpenMP 
to  spawn  threads  for  the  simultaneous  prediction  and  correction  steps.  This  non- 
traditional  hybrid  use  of  OpenMP  and  MPI  currently  requires  writing  of  custom 
thread-safe  and  thread-independent  MPI  routines.  Profile  analysis  shows  that  our 
non-traditional  use  of  OpenMP-MPI  suffers  from  higher  communication  costs  than 
the  standard  use  of  OpenMP-MPI.  An  inspection  of  the  prediction  and  correction 


381 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Ronald  D.  Haynes  and  Benjamin  W.  Ong 


equations  indicates  that  optimized  Schwarz  variants  of  the  algorithm  are  possible 
and  will  enjoy  nice  load  balancing.  This  work  is  ongoing. 
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